PDF documents and PDF generators are ubiquitous on the web, and so are injection vulnerabilities. Did you know that controlling a measly HTTP hyperlink can provide a foothold into the inner workings of a PDF? In this paper, you will learn how to use a single link to compromise the contents of a PDF and exfiltrate it to a remote server, just like a blind XSS attack.
This whitepaper is also available as a printable PDF, and as a «director’s cut» edition of a presentation premiered at Black Hat Europe 2020:
It all started when my colleague, James «albinowax» Kettle, was watching a talk on PDF encryption at BlackHat. He was looking at the slides and thought «This is definitely injectable». When he got back to the office, we had a discussion about PDF injection. At first, I dismissed it as impossible. You wouldn’t know the structure of the PDF and, therefore, wouldn’t be able to inject the correct object references. In theory, you could do this by injecting a whole new xref table, but this won’t work in practice as your new table will simply be ignored… Here at PortSwigger, we don’t stop there; we might initially think an idea is impossible but that won’t stop us from trying.
How can user input get inside PDFs?
Server-side PDF generation is everywhere; it’s in e-tickets, receipts, boarding passes, invoices, pay slips…the list goes on. So there’s plenty of opportunity for user input to get inside a PDF document. The most likely targets for injection are text streams or annotations as these objects allow developers to embed text or a URI, enclosed within parentheses. If a malicious user can inject parentheses, then they can inject PDF code and potentially insert their own harmful PDF objects or actions.
Why try to inject PDF code?
Consider an application where multiple users work on a shared PDF containing sensitive information, such as bank details. If you are able to control part of that PDF via an injection, you could potentially exfiltrate the entire contents of the file when another user accesses it or interacts with it in some way. This works just like a classic XSS attack but within the scope of a PDF document.
Why can’t you inject arbitrary content?
I have devised the following methodology for PDF injection: Identify, Construct, and Exploit.
First of all, you need to identify whether the PDF generation library is escaping parentheses or backslashes. You can also try to generate these characters by using multi-byte characters that contain 0x5c (backslash) or 0x29 (parenthesis) in the hope the library incorrectly converts them to single-byte characters. Another possible method of generating parentheses or backslashes is to use characters outside the ASCII range. This can cause an overflow if the library incorrectly handles the character. You should then see if you can break the PDF structure by injecting a NULL character, EOF markers, or comments.
I tried around 8 different libraries while conducting this research. Of these, I found two that were vulnerable to PDF injection: PDF-Lib and jsPDF, both of which are npm modules. PDF-Lib has over 52k weekly downloads and jsPDF has over 250k. Each library seems to correctly escape text streams but makes the mistake of allowing PDF injection inside annotations. Here is an example of how you create annotations in PDF-Lib:
As you can see in the code sample, PDF-Lib has a helper function to generate PDF strings, but it doesn’t escape parentheses. So if a developer places user input inside a URI, an attacker can break out and inject their own PDF code. The other library, jsPDF, has the same problem, but this time in the url property of their annotation generation code:
The first step was to test a PDF library, so I downloaded PDFKit, created a bunch of test PDFs, and looked at the generated output. The first thing that stood out was text objects. If you have an injection inside a text stream then you can break out of the text using a closing parenthesis and inject your own PDF code.
A PDF text object looks like the following:
BT indicates the start of a text object, /F13 sets the font, 12 specifies the size, and Tf is the font resource operator (it’s worth noting that in PDF code, the operators tend to follow their parameters).
The numbers that follow Tf are the starting position on the page; the Td operator specifies the position of the text on the page using those numbers. The opening parenthesis starts the text that’s going to be added to the page, «ABC» is the actual text, then the closing parenthesis finishes the text string. Tj is the show text operator and ET ends the text object.
Controlling the characters inside the parentheses could enable us to break out of the text string and inject PDF code.
I tried all the techniques mentioned in my methodology with PDFKit, PDF Make, and FPDF, and got nowhere. At this point, I parked the research and did something else for a while. I often do this if I reach a dead-end. It’s no good wasting time on research that is going nowhere if nothing works. I find coming back to later with a fresh mind helps a lot. Being persistent is great, but don’t fall into the trap of being repetitive without results.
With a fresh mind, I picked up the research again and decided to study the PDF specification. Just like with XSS, PDF injections can occur in different contexts. So far, I’d only looked at text streams, but sometimes user input might get placed inside links. Annotations stood out to me because they would allow developers to create anchor-like links on PDF text and objects. By now I was on my 4th PDF library. This time, I was using PDFLib. I took some time to use the library to create an annotation and see if I could inject a closing parenthesis into the annotation URI — and it worked! The sample vulnerable code I used to generate the annotation code was:
As you can clearly see, the injection string is closing the text boundary with a closing parenthesis, which leaves an existing closing parenthesis that causes the PDF to be rendered incorrectly:
To set a flag, you first need to look up its bit position (table 237 of the PDF specification). In this case, we want to set the SubmitPDF flag. As this is controlled by the 9th bit, you just need to count 9 bits from the right:
Next I applied my methodology to another PDF library — jsPDF — and found it was vulnerable too. Exploiting this library was quite fun because they have an API that can execute in the browser and will allow you to generate the PDF in real time as you type. I noticed that, like the PDP-Lib library, they forgot to escape parentheses inside annotation URLs. Here the url property was vulnerable:
So I generated a PDF using their API and injected PDF code into the url property:
I reduced the vector by removing the type entries of the dictionary and the unneeded F entry. I then left a dangling parenthesis that would be closed by the existing one. Reducing the size of the injection is important because the web application you are injecting to might only allow a limited amount of characters.
Further research revealed that you can also inject multiple annotations. This means that instead of just injecting an action, you could break out of the annotation and define your own rect coordinates to choose which section of the document would be clickable. Using this technique, I was able to make the entire document clickable.
Writing an enumerator
The next stage was to look at how Acrobat handles PDFs that are loaded from the filesystem, rather than being served directly from a website. In this case, there are more restrictions in place. For example, when you try to submit a form to an external URL, this will now trigger a prompt in which the user has to manually confirm that they want to submit the form. To get around these restrictions I wrote an enumerator/fuzzer to call every function on every object to see if a function would allow me to contact an external server without user interaction.
The enumerator first runs a for loop on the global object «this». I skipped the methods getURL, submitForm, and the console object because I knew that they cause prompts and do not allow you to contact external servers unless you click allow. Try-catch blocks are used to prevent the loop from failing if an exception is thrown because the function can’t be called or the property isn’t a valid function. Burp Collaborator is used to see whether the server was contacted successfully — I add the key being checked in the subdomain so that Collaborator will show which property allowed the interaction.
Using this fuzzer, I discovered a method that can be called that contacts an external server: CBSharedReviewIfOfflineDialog will cause a DNS interaction without requiring the user to click allow. You could then use DNS to exfiltrate the contents of the PDF or other information. However, this still requires a click since our injection uses an annotation action.
Executing annotations without interaction
So far, the vectors I’ve demonstrated require a click to activate the action from the annotation. Typically, James asked the question «Can we execute automatically?». I looked through the PDF specification and noticed some interesting features of annotations:
«The PV and PI entries allow a distinction between pages that are open and pages that are visible. At any one time, only a single page is considered open in the viewer application, while more than one page may be visible, depending on the page layout.»
We can add the PV entry to the dictionary and the annotation will fire on Acrobat automatically! Not only that, but we can also execute a payload automatically when the PDF document is closed using the PC entry. An attacker could track you when you open the PDF and close it.
Here’s how to execute automatically from an annotation:
When you close the PDF, this annotation will fire:
As you can see, the above vector requires knowledge of the PDF structure. [ 3 0 R] refers to a specific PDF object and if we were doing a blind PDF injection attack, we wouldn’t know the structure of it. Still, the next stage is to try a form submission. We can use the submitForm function for this, and because the annotation requires a click, Chrome will allow it:
This works, but it’s messy and requires knowledge of the PDF structure. We can reduce it a lot and remove the reliance on the PDF structure:
There’s still some code we can remove:
Next I looked at the submitForm function to steal the contents of the PDF. We know that we can call the function and it does contact an external server, as demonstrated in one of the examples above, but does it support the full Acrobat specification? I looked at the source code of PDFium but the function doesn’t support SubmitAsPDF 🙁 You can see it supports FDF, but unfortunately this doesn’t submit the contents of the PDF. I looked for other ways but I didn’t know what objects were available. I took the same approach I did with Acrobat and wrote a fuzzer/enumerator to find interesting objects. Getting information out of Chrome was more difficult than Acrobat; I had to gather information in chunks before outputting it using the alert function. This was because the alert function truncated the string sent to it.
Inspecting the output of the enumerator, I tried calling various functions in the hope of making external requests or gathering information from the PDF. Eventually, I found a very interesting function called getPageNthWord, which could extract words from the PDF document, thereby allowing me to steal the contents. The function has a subtle bug where the first word sometimes will not be extracted. But for the most part, it will extract the majority of words:
SSRF in PDFium/Acrobat
You can even send raw new lines, which could be useful when chaining other attacks such as request smuggling. The result of the POST request can be seen in the following Collaborator request:
PDF upload «formcalc» technique
If you are writing a PDF library, it’s recommended that you escape parentheses and backslashes when accepting user input within text streams or annotation URIs. As a developer, you can use the injections mentioned in this paper to confirm that any user input doesn’t cause PDF injection. Consider performing validation on any content going into PDFs to ensure you can’t inject PDF code.
- Vulnerable libraries can make user input inside PDFs dangerous by not escaping parentheses and backslashes.
- A clear objective helps when tackling seemingly impossible problems and persistence pays off when trying to achieve those goals.
- One simple link can compromise the entire contents of an unknown PDF.
You can download all the injection examples in this whitepaper at:
I knew nothing about the structure of PDFs until I watched a talk about building your own PDF manually by Ange Albertini. He is a great inspiration to me and without his learning materials this post would never have been made. I’d also like to credit Alex «InsertScript» Inführ, who covered PDFs in his mess with the web presentation. It blew everyone’s mind when he demonstrated how much a PDF was able to do. Thank you to both of you. I’d also like to thank Ben Sadeghipour & Cody Brocious for the idea of performing a SSRF attack from a PDF in their excellent presentation.