Only three distinct parts in a PDF file after the version number: a series of numerically labeled objects (in any order), the cross reference table associating byte offsets with each labeled object, and the trailer dictionary declaring the primary labeled object and document info object.
The very beginning is the minimal version number of the PDF specification required for the document.
It is recommended the second line is a comment with at least four binary characters (character code 128 or greater) when the PDF has binary data, as an early indicator for file transfer programs to know to transfer as binary rather than text. Generally, compressed data is binary data. In emacs the command insert-char
provides the means for inserting special characters.
%PDF-1.4 %\200\200\200\200
The cross reference table begins with the keyword xref
and then two numbers: the number of the first indirect object in the list, the total number listed.
Indirect objects are labeled as contiguous numbers starting from 1. The cross reference table is a list referencing each indirect object in sequential order beginning from the non-existent 0 object. Therefore, object 1 is the second line, object 2 is the third line, and so on.
Unlike the rest of the PDF file, the listing is strictly formatted with each entry exactly 20 bytes of three space separated values:
10-digit zero-prefixed number. When an indirect object is "in use" this is the byte offset from beginning of file, t.i. before "%PDF-...". When freed this is instead the number of the next freed object.
5-digit zero-prefixed number. The generation of the indirect object. Starts as 0 and increased by one when freed. Always 65535 for the 0 object, which also means to never reuse the object.
the letter n
for "in use" or f
for "freed"
The last two bytes are the end-of-line (EOL) characters (typically: Controlm and Controlj), or a space then a new line (Controlj).
xref 0 ?? 0000000000 65535 f
On the other hand, PDF readers seem to ignore incorrect byte offsets in the cross reference table and have been able to display the document anyway.
The keyword trailer
and a dictionary, then the keyword startxref
and the byte offset of the last cross reference table, and finally the end of file marker.
trailer << /Size ?? /Info ?? 0 R /Root ?? 0 R >> startxref ?? %%EOF
Whitespace is the same as HTML with the addition of the null character, t.i. space, horizontal tab, line feed (newline), form feed, carriage return, and null. Also known as space, ^I, ^J, ^L, ^M, ^A. In emacs, using Controlq as needed: , Controlq Controli, Controlq Controlj, Controlq Controll, Controlq Controlm, Controlq Controla. Similar to HTML the whitespace characters are typically equal and excessive, but in contrast are significant within strings and streams.
The end of line (EOL) is sometimes significant, such as when ending a stream. The EOL can be a carriage return followed by a line feed, however, it can also simply be a line feed.
The delimiters are parenthesis, curly braces, square brackets, angle brackets, forward slash, and percent sign: (, ), {, }, [, ], <, >, /, %. Of course, whitespace is unnecessary around delimiters, though conventionally added.
A series of characters other than delimiters or whitespace is referred to as a token within the documentation.
Overall, there are supposedly eight types of objects for PDF: names, booleans, integers, real numbers, null
, arrays, dictionaries, streams.
Names begin with a forward slash / followed by any character other than delimiters and whitespace, and are case-sensitive.
Boolean values are the lowercase keywords true
and false
. Integers and real numbers differ only by the latter having a period, and integers can be used in place of real numbers. The null object is simply the keyword null
.
A string of text is encapsulated within parentheses ( ) rather than quotes, and includes newlines (unless backquoted). Paired parentheses are okay within a string, however an unpaired parenthesis must be backquoted, t.i. \( or \). A string of hexadecimal data is encapsualted within single angle brackets < >.
Arrays are within square brackets [ ] as whitespace separated values.
Dictionaries are within double angle brackets << >> as whitespace separated pairs of names and values.
A stream object is a dictionary object followed by stream-data encapsulated by newlines (required type of whitespace) and the keywords stream and endstream. Therefore, a stream is always referenced because it needs to be in an indirect object in order to group its parts.
In other words, a stream object is always within an indirect object:
1 0obj
<<dictionary>>stream
NEWLINEstream-dataNEWLINEendstream
endobj
The dictionary object of a stream must specify the length of the stream-data, and optionally the stream-data can be encoded.
Flate compressed data can be decompressed with pigz
-d -. The -d is for decompressing, and the - is for accepting standard input. In emacs, the command shell-command-on-region
can be used on selected text and results are sent to the default output buffer. The default shortcut is M-| (Meta-pipe), and using the prefix arg (C-u) means replace the selected text with the results.
An object can be a type of indirect object by encapsulating it as: a label (a number), a generation number, the keyword obj
; then the object; and then the keyword endobj
.
An indirect reference is to an indirect object: label, generation number, keyword R
, f.e. 1 0 R
. It is a single value though it is whitespace separated, so it looks like it's three values when within an array or dictionary, but is only one. Take note of the keyword R
.
Some dictionary keys require an indirect reference, or allow an array of them.
Indirect objects exchange the repetition of its reference for the repetition of its object. File size can be reduced only when the length of the reference is less than the object's length, and used at least twice. This implies numbers as indirect objects are unlikely to ever reduce file size, and booleans or null
never.
Indirect objects typically have a generation number of "0". Therefore, the length of a reference is the length of the label of the indirect object plus 4 for its ending of 0 R
. Indirect objects labeled with 1–9 have references five characters long, labeled 10–99 have references six characters long, and so on.
The size of an indirect object is the sum of the overhead of encapsulating the object plus its object's length.
In addition to the length of the object (O), the size of an indirect object (I) is the sum of the length of its label plus 7 characters for the typical generation number and beginning " 0 obj
", and 8 characters for the ending " endobj
", when conventionally including whitespace.
Therefore the overhead for an indirect object labeled 1–9 is 16 characters and the length of the indirect object (I) is "16 + O". When labeled 10–99 the overhead is 17 characters and "I = 17 + O", and so on.
In order to reduce the file size when using an indirect object a specific number of times (n), the length of the indirect object (I) must be less than that multiple of the difference of its object's length (O) and the length of its reference (R): n(O - R) > I. When equal, the file size remains the same.
For indirect objects labeled 1–9 (I=16+O and R=5) and used twice (n=2):
2(O - 5) > (16 + O), O > 26 characters.
When labeled 10–99: 2(O - 6) > (17 + O), O > 29 characters. For labels 100–999: 2(O - 7) > (18 + O), O > 32 characters.
Content streams have potential for reuse by means of being listed within an array for a /Contents key of a Page indirect object.
In addition to the overhead of an indirect object, the size of a stream includes its dictionary and its stream data. The stream dictionary is minimally 12 characters <</Length >>
for non-filtered stream-data plus the length of the number (L) for the length of the stream-data: "12 + L". The size of the encapsulated stream-data is its length (O) plus 8 characters stream^J
before and 11 characters ^Jendstream
after, when conventionally encapsulated by whitespace: "19 + O".
Therefore, the size of an indirect stream object (Is) is the size of an indirect object (I) plus "31 + L",
In contrast to a single stream, a whitespace character is required for separating each additional reference in an array of content streams: [1 0 R 2 0 R 3 0 R]
. In that case, the reference length for other streams is one more character than usual: R=6 instead of 5 when labeled 1–9, R=7 instead of 6 when labeled 10–99, and so on.
For indirect stream objects labeled 1–9 (I=16+O and R=6) used twice (n=2): 2(O - R) > Is, 2(O - R) > (I + 31 + L), 2(O - 6) > (47 + O + L), O > 59 + L. Therefore the length of the stream data (O) must be greater than 61 characters (59 plus the length of that number "59", which is 2), in order for the file size to decrease. Otherwise, it takes more space to have the indirect stream object for reuse in an array of content streams than to just combine them into one content stream.
Comments begin with a percent sign % and continue until the end of the line, but only outside strings or streams.
Dates are strings of text with numbers for each unit: (D:YYYYMMDDHHmmSSOHH'mm'). The "O" is either -, +, or Z for minus, plus, or same as UTC by the offset of "HH'mm'".
Rectangles are an array of two coordinates, the lower left corner and the upper right corner: [llx lly urx ury].
A destination value can be directly specified with an array of the indirect reference to the /Page object followed by a name indicating how to use the values for a location (coordinates) on that page and the magnification (zoom) level.
null
retains a parameter unchanged
A name or string can also be used as a reference to a destination value, which is required for linking to external documents for lack of an indirect reference for the array.
In order to use a name for a destination, the targeted document needs its /Catalog object to have a /Dests key with a dictionary of names to destinations. In order to use a string for a destination, the targeted document needs its /Catalog object to have a /Names key with a dictionary of dictionaries for naming indirect objects, one of which can be a /Dests subdictionary with an indirect reference to a name tree (3.8.4 Name Trees, p.101) associating names to destinations.
An indirect object conventionally starts at the beginning of a line with three whitespace separated values: a number as its label, a generation number, and the keyword obj
. The object begins on the next line for as many lines as needed. The last line for an indirect object has only the keyword endobj
. However, newlines are only conventional, any whitespace will suffice.
Contents of indirect objects that are or begin with a dictionary typically have a key in the dictionary declaring its /Type
.
Page numbers are from the PDF reference, third edition (version 1.4).
Referenced from the document trailer dictionary key /Info. Values are text string objects or indirect references thereof.
See section 3.6.1 Document Catalog, page 83.
A document outline is recorded as a dictionary of outline item dictionaries.
An outline item dictionary is just a basic dictionary, t.i. without a /Type key.
See section 3.6.2 Page Tree, page 86.
[ ]
of indirect references to /Page or /Pages objects.
Additionally, can have the keys of /Page dictionary that are intended to be inherited by its pages, such as /MediaBox and /Resources.
See section 3.6.2 Page Tree, bottom of page 87.
Some entries can be inherited by omitting them in the /Page object and specifying them in the /Parent object, which is a /Pages object.
A content stream is simply a labeled object for a stream that has a set of instructions as a series of operands and operators. The /Contents can be an array of streams split between tokens and is concatenated into a single stream. This potentially allows for reuse of common parts.
The operands for an operator are listed immediately before it. An operand is any direct object (no indirect references) other than streams, or a named resource within a resource dictionary of /Resources. Operators are specific keywords that have meaning only within content streams, described in chapters 4, 5, 6, and 9 of the PDF Reference. Similarly, named resources are known only from the specific resource dictionary of /Resources associated with the same /Page.
Saving and restoring the graphics state with q
and Q
must be balanced within the /Contents as a whole for the /Page, if used at all.
A resource dictionary is a dictionary of specific named dictionaries, each of which is for associating names with any labeled objects for use as operands in the content streams of /Contents of its /Page. This is the only way those content streams can use indirect objects as operands for the operators. Each subdictionary is named for specific types of objects:
See section 9.4 Page-Piece Dictionaries, page 581.
The /PieceInfo dictionary can contain many dictionaries, each keyed by the name of a distinct application, f.e. /emacs, or a "well-known" data type. Each sub-dictionary has two keys:
Therefore, the info is in a sub-sub-dictionary.
- /emacs
- /LastModified
- (D:20180917060000Z00'00')
- /Private
- /whatever
- (Some text.)
- /something
- (More info.)
An image object is a dictionary with information about the image followed by the stream data for the image. The stream of data for a JPEG image is simple the exact same contents of the JPEG file, and its color space might be embedded (see B.4 on p.85 of the ICC1v43_2010-12.pdf).
In addition to the keys for a stream dictionary:
true
1, 2, 4, or 8. REQUIRED, but optional for image masks (and then it's 1). Always 8 when stream /Filter is /DCTDecode (JPEG).
false
; approves attempts for smoothing small images (few pixels) across a large space (many pixels)
An image is in "image space", which means it is one unit wide by one unit high. Therefore, a transformation matrix for scaling (using width 0 0 height 0 0
with cm
in a content stream) is necessary in order for it to be large enough to be visible. Making the width and height of the painted image (from Do
in a content stream) the same as the /MediaBox for a /Page will paint the page completely with the image.
A color space is defined by an array of the color space family name followed by its parameters, if any. Some color space families are without parameters and therefore those can be specified as simply their names instead of an array.
A basic foundation of objects for a PDF document. Notably, they are indirect dictionary objects and all are ultimately discovered from the trailer dictionary (a direct object).
Essentially, all that remains is to create some /Page objects for the /Kids array of the main /Pages dictionary, and optionally some outline item dictionaries for the /Outlines dictionary. Technically, a "page" can be whatever size, therefore there could be just one /Page object with everything on it, infinitely scrollable like an HTML document.
Afterwards, create the cross reference table listing the byte offsets for all the indirect objects, set the /Size key of the trailer dictionary, and set the byte offset for startxref
.
Keyboard macros used in emacs (aka Editing MACroS).
Keyboard macro for updating the byte offsets in the cross reference table.
Requires register "8" to have the point at the first line of the table in the cross reference table, which is object 0. Also note that ESC
is normally represented as the META modifier key, abbreviated to M-
, in macros.
Start with the cursor at the point after the keyword endobj
, which is conventionally the end of a line. This is also one character before the label of the next object on the next line. The value of the command `point' will be the byte offset from the beginning of the file for the next object because it is the position of the character at that point. (Notice the `point' command at the beginning of the file returns the value "1", not "0", therefore its the position and the cursor must be positioned one character before.)
;Store the point value of the position before the object ;into a register. M-: ;; eval-expression M-( ;; () set-register ;; (set-register) SPC ;; (set-register ) ?h ;; (set-register ?h) M-( ;; (set-register ?h()) point ;; (set-register ?h(point)) RET ;Get the label for the object, which is a number, from the ;next line. ;This will be used to skip forward that number of lines ;from the beginning of the cross reference table. C-f ;; forward-char C-x r n ;; number-to-register n ;Move to the next object, which would be at the end of this ;one if there is another. C-s ;; isearch-forward endobj RET ;Store point to a register for returning the cursor to this ;position when done for beginning the macro again. C-x r SPC ;; point-to-register 9 ;Jump to the point of the beginning of the first line of the ;table in the cross reference table, which is object number 0. ;Remember: this has to be set before using this macro. C-x r j ;; jump-to-register 8 ;The label of the object is also the number of its line in ;the cross reference table. M-: ;; eval-expression M-( ;; () forward-line ;; (forward-line) SPC ;; (forward-line ) C-x r i ;; insert-register n ;; This will be the label stored earlier. RET ;Select the byte offset and delete. M-F ;; forward-word, and select DEL ;; delete-backward-char ;Insert the new 10-digit byte offset prefixed with zeros. M-: ;; eval-expression M-( ;; () insert ;; (insert) M-( ;; (insert()) format"%010d" ;; (insert(format"%010d")) C-x r i ;; insert-register h ;; The value of point before the object. RET ;Jump back to the end of this object in preparation for ;using this macro again, on the next object. C-x r j ;; jump-to-register 9