PDF notes

PDF notes


# basic layout of a PDF document

Only three distinct parts in a PDF file after the version number: a series of numerically labeled objects (in any order), the cross reference table associating byte offsets with each labeled object, and the trailer dictionary declaring the primary labeled object and document info object.

The very beginning is the minimal version number of the PDF specification required for the document.

It is recommended the second line is a comment with at least four binary characters (character code 128 or greater) when the PDF has binary data, as an early indicator for file transfer programs to know to transfer as binary rather than text. Generally, compressed data is binary data. In emacs the command insert-char provides the means for inserting special characters.

%PDF-1.4
%\200\200\200\200

# cross reference table

The cross reference table begins with the keyword xref and then two numbers: the number of the first indirect object in the list, the total number listed.

Indirect objects are labeled as contiguous numbers starting from 1. The cross reference table is a list referencing each indirect object in sequential order beginning from the non-existent 0 object. Therefore, object 1 is the second line, object 2 is the third line, and so on.

Unlike the rest of the PDF file, the listing is strictly formatted with each entry exactly 20 bytes of three space separated values:

The last two bytes are the end-of-line (EOL) characters (typically: Controlm and Controlj), or a space then a new line (Controlj).

xref
0 ??
0000000000 65535 f 

On the other hand, PDF readers seem to ignore incorrect byte offsets in the cross reference table and have been able to display the document anyway.

# ending

The keyword trailer and a dictionary, then the keyword startxref and the byte offset of the last cross reference table, and finally the end of file marker.

# trailer dictionary

/Size
(REQUIRED) total number of indirect objects, including the non-existent 0 object
/Prev
byte offset of the previous cross reference table, when more than one
/Root
(REQUIRED) indirect object reference to the /Catalog object
/Info
indirect object reference to the document info dictionary
/ID
an array of two unique file ids, both the same when the document has yet to be updated
trailer
<< /Size ?? /Info ?? 0 R /Root ?? 0 R >>
startxref
??
%%EOF

# syntax of PDF

# whitespace, delimiters, tokens

Whitespace is the same as HTML with the addition of the null character, t.i. space, horizontal tab, line feed (newline), form feed, carriage return, and null. Also known as space, ^I, ^J, ^L, ^M, ^A. In emacs, using Controlq as needed: , Controlq Controli, Controlq Controlj, Controlq Controll, Controlq Controlm, Controlq Controla. Similar to HTML the whitespace characters are typically equal and excessive, but in contrast are significant within strings and streams.

The end of line (EOL) is sometimes significant, such as when ending a stream. The EOL can be a carriage return followed by a line feed, however, it can also simply be a line feed.

The delimiters are parenthesis, curly braces, square brackets, angle brackets, forward slash, and percent sign: (, ), {, }, [, ], <, >, /, %. Of course, whitespace is unnecessary around delimiters, though conventionally added.

A series of characters other than delimiters or whitespace is referred to as a token within the documentation.

# objects

Overall, there are supposedly eight types of objects for PDF: names, booleans, integers, real numbers, null, arrays, dictionaries, streams.

Names begin with a forward slash / followed by any character other than delimiters and whitespace, and are case-sensitive.

Boolean values are the lowercase keywords true and false. Integers and real numbers differ only by the latter having a period, and integers can be used in place of real numbers. The null object is simply the keyword null.

A string of text is encapsulated within parentheses ( ) rather than quotes, and includes newlines (unless backquoted). Paired parentheses are okay within a string, however an unpaired parenthesis must be backquoted, t.i. \( or \). A string of hexadecimal data is encapsualted within single angle brackets < >.

Arrays are within square brackets [ ] as whitespace separated values.

Dictionaries are within double angle brackets << >> as whitespace separated pairs of names and values.

# streams

A stream object is a dictionary object followed by stream-data encapsulated by newlines (required type of whitespace) and the keywords stream and endstream. Therefore, a stream is always referenced because it needs to be in an indirect object in order to group its parts.

In other words, a stream object is always within an indirect object:

1 0 obj
<<dictionary>>streamNEWLINEstream-dataNEWLINEendstream
endobj

The dictionary object of a stream must specify the length of the stream-data, and optionally the stream-data can be encoded.

/Length
(REQUIRED) number of bytes of the stream-data
/Filter
a name of a filter, or an array of names in the order to be applied; f.e. /FlateDecode, /DCTDecode
/DecodeParams
a dictionary of parameters for the filter, or an array of dictionaries for the filters

Flate compressed data can be decompressed with pigz -d -. The -d is for decompressing, and the - is for accepting standard input. In emacs, the command shell-command-on-region can be used on selected text and results are sent to the default output buffer. The default shortcut is M-| (Meta-pipe), and using the prefix arg (C-u) means replace the selected text with the results.

# indirect objects

An object can be a type of indirect object by encapsulating it as: a label (a number), a generation number, the keyword obj; then the object; and then the keyword endobj.

An indirect reference is to an indirect object: label, generation number, keyword R, f.e. 1 0 R. It is a single value though it is whitespace separated, so it looks like it's three values when within an array or dictionary, but is only one. Take note of the keyword R.

Some dictionary keys require an indirect reference, or allow an array of them.

# it goes without saying…

Indirect objects exchange the repetition of its reference for the repetition of its object. File size can be reduced only when the length of the reference is less than the object's length, and used at least twice. This implies numbers as indirect objects are unlikely to ever reduce file size, and booleans or null never.

Indirect objects typically have a generation number of "0". Therefore, the length of a reference is the length of the label of the indirect object plus 4 for its ending of 0 R. Indirect objects labeled with 1–9 have references five characters long, labeled 10–99 have references six characters long, and so on.

The size of an indirect object is the sum of the overhead of encapsulating the object plus its object's length.

# overhead of indirect objects

In addition to the length of the object (O), the size of an indirect object (I) is the sum of the length of its label plus 7 characters for the typical generation number and beginning " 0 obj ", and 8 characters for the ending " endobj ", when conventionally including whitespace.

Therefore the overhead for an indirect object labeled 1–9 is 16 characters and the length of the indirect object (I) is "16 + O". When labeled 10–99 the overhead is 17 characters and "I = 17 + O", and so on.

In order to reduce the file size when using an indirect object a specific number of times (n), the length of the indirect object (I) must be less than that multiple of the difference of its object's length (O) and the length of its reference (R): n(O - R) > I. When equal, the file size remains the same.

For indirect objects labeled 1–9 (I=16+O and R=5) and used twice (n=2):
2(O - 5) > (16 + O), O > 26 characters.

When labeled 10–99: 2(O - 6) > (17 + O), O > 29 characters. For labels 100–999: 2(O - 7) > (18 + O), O > 32 characters.

# reuse of streams for reduction

Content streams have potential for reuse by means of being listed within an array for a /Contents key of a Page indirect object.

In addition to the overhead of an indirect object, the size of a stream includes its dictionary and its stream data. The stream dictionary is minimally 12 characters <</Length >> for non-filtered stream-data plus the length of the number (L) for the length of the stream-data: "12 + L". The size of the encapsulated stream-data is its length (O) plus 8 characters stream^J before and 11 characters ^Jendstream after, when conventionally encapsulated by whitespace: "19 + O".

Therefore, the size of an indirect stream object (Is) is the size of an indirect object (I) plus "31 + L",

In contrast to a single stream, a whitespace character is required for separating each additional reference in an array of content streams: [1 0 R 2 0 R 3 0 R]. In that case, the reference length for other streams is one more character than usual: R=6 instead of 5 when labeled 1–9, R=7 instead of 6 when labeled 10–99, and so on.

For indirect stream objects labeled 1–9 (I=16+O and R=6) used twice (n=2): 2(O - R) > Is, 2(O - R) > (I + 31 + L), 2(O - 6) > (47 + O + L), O > 59 + L. Therefore the length of the stream data (O) must be greater than 61 characters (59 plus the length of that number "59", which is 2), in order for the file size to decrease. Otherwise, it takes more space to have the indirect stream object for reuse in an array of content streams than to just combine them into one content stream.

# comments

Comments begin with a percent sign % and continue until the end of the line, but only outside strings or streams.

# common data structures

Dates are strings of text with numbers for each unit: (D:YYYYMMDDHHmmSSOHH'mm'). The "O" is either -, +, or Z for minus, plus, or same as UTC by the offset of "HH'mm'".

Rectangles are an array of two coordinates, the lower left corner and the upper right corner: [llx lly urx ury].

# destinations

A destination value can be directly specified with an array of the indirect reference to the /Page object followed by a name indicating how to use the values for a location (coordinates) on that page and the magnification (zoom) level.

[page /XYZ left top zoom]
position coordinates of page at top-left corner of view and magnifiy; null retains a parameter unchanged
[page /Fit]
fit page both horizontally and vertically within the view
[page /FitH top]
position top coordinate of page at top edge of view and fit page horizontally
[page /FitV left]
position left coordinate of page at left edge of view and fit page vertically
[page /FitR left bottom right top]
fit the contents of a rectangle within the view
[page /FitB]
fit bounding box of page both horizontally and vertically within the view
[page /FitBH top]
position top coordinate of page at top edge of view and fit bounding box of page horizontally
[page /FitBV left]
position left coordinate of page at left edge of view and fit bounding box of page vertically

A name or string can also be used as a reference to a destination value, which is required for linking to external documents for lack of an indirect reference for the array.

In order to use a name for a destination, the targeted document needs its /Catalog object to have a /Dests key with a dictionary of names to destinations. In order to use a string for a destination, the targeted document needs its /Catalog object to have a /Names key with a dictionary of dictionaries for naming indirect objects, one of which can be a /Dests subdictionary with an indirect reference to a name tree (3.8.4 Name Trees, p.101) associating names to destinations.

# types of indirect objects

An indirect object conventionally starts at the beginning of a line with three whitespace separated values: a number as its label, a generation number, and the keyword obj. The object begins on the next line for as many lines as needed. The last line for an indirect object has only the keyword endobj. However, newlines are only conventional, any whitespace will suffice.

Contents of indirect objects that are or begin with a dictionary typically have a key in the dictionary declaring its /Type.

Page numbers are from the PDF reference, third edition (version 1.4).

# document info dictionary

Referenced from the document trailer dictionary key /Info. Values are text string objects or indirect references thereof.

/Title
(1.1) title of document
/Author
name of person
/Subject
(1.1) subject of document
/Keywords
(1.1) keywords for the document
/Creator
name of application used to create original document before converting it to PDF
/Producer
name of application used for converting to PDF
/CreationDate
date and time of document creation
/ModDate
(1.1) date and time of most recent modification

# /Type /Catalog

See section 3.6.1 Document Catalog, page 83.

/Pages
(REQUIRED) indirect reference to a /Pages object
/PageMode
/UseNone, /UseOutlines, /UseThumbs, /FullScreen; how to initially display the document
/Outlines
indirect reference to an /Outlines object

# /Type /Outlines dictionary

A document outline is recorded as a dictionary of outline item dictionaries.

/First
(REQUIRED) indirect reference to its first outline item dictionary
/Last
(REQUIRED) indirect reference to its last outline item dictionary
/Count
number of outline items and all open suboutlines; REQUIRED when there are any outline items

# outline item dictionary

An outline item dictionary is just a basic dictionary, t.i. without a /Type key.

/Title
(REQUIRED) a text string for the label of the outline item
/Parent
(REQUIRED) indirect reference to outline item dictionary with this one, or to the outline dictionary
/Prev
indirect reference to previous outline item dictionary; REQUIRED unless this item is first
/Next
indirect reference to next outline item dictionary; REQUIRED unless this item is last
/First
indirect reference to first outline item dictionary of its suboutline; REQUIRED when it has a suboutline
/Last
indirect reference to last outline item dictionary of its suboutline; REQUIRED when it has a suboutline
/Count
positive number of all outline items in all suboutlines when this one is open, otherwise negative number; REQUIRED when it has a suboutline
/Dest
the destination for the outline item

# /Type /Pages

See section 3.6.2 Page Tree, page 86.

/Parent
(REQUIRED) indirect reference to the page tree node (another /Pages object) with this one; however, unneeded for root node
/Kids
(REQUIRED) an array [ ] of indirect references to /Page or /Pages objects.
/Count
(REQUIRED) number of /Page objects

Additionally, can have the keys of /Page dictionary that are intended to be inherited by its pages, such as /MediaBox and /Resources.

# /Type /Page

See section 3.6.2 Page Tree, bottom of page 87.

Some entries can be inherited by omitting them in the /Page object and specifying them in the /Parent object, which is a /Pages object.

/Parent
(REQUIRED) indirect reference to the page tree node (a /Pages object) with this one
/MediaBox
(REQUIRED, but inheritable) a rectangle defining the bounds of the content in default units (72dpi)
/Resources
(REQUIRED, but inheritable) a dictionary of resources for the /Contents; an empty dictionary when none, or omitted when inherited
/Contents
a content stream or an array of content streams; page is empty without it
/Rotate
number of degrees to rotate page, default 0; must be a multiple of 90
/PieceInfo
a page-piece dictionary
/LastModified
date and time, REQUIRED when /PieceInfo is set

# content streams of /Contents

A content stream is simply a labeled object for a stream that has a set of instructions as a series of operands and operators. The /Contents can be an array of streams split between tokens and is concatenated into a single stream. This potentially allows for reuse of common parts.

The operands for an operator are listed immediately before it. An operand is any direct object (no indirect references) other than streams, or a named resource within a resource dictionary of /Resources. Operators are specific keywords that have meaning only within content streams, described in chapters 4, 5, 6, and 9 of the PDF Reference. Similarly, named resources are known only from the specific resource dictionary of /Resources associated with the same /Page.

q
save the current graphics state
Q
restore the previous graphics state
ri
rendering intent: /AbsoluteColorimetric, /RelativeColorimetric, /Saturation, or /Perceptual; likely matches with /Intent of /Type /XObject /Subtype /Image
cm
current transformation matrix, six numbers
Do
paint an image referenced by name from the resource dictionary of the /Page, such as /Type /XObject /Subtype /Image

Saving and restoring the graphics state with q and Q must be balanced within the /Contents as a whole for the /Page, if used at all.

# resource dictionary of /Resources

A resource dictionary is a dictionary of specific named dictionaries, each of which is for associating names with any labeled objects for use as operands in the content streams of /Contents of its /Page. This is the only way those content streams can use indirect objects as operands for the operators. Each subdictionary is named for specific types of objects:

/XObject
indirect references to dictionaries of "external objects" (/Type /XObject), such as images (/Subtype /Image)
/Font
indirect references to font dictionaries (/Type /Font)
/ProcSet
(obsolete since 1.4) an array of needed procedure set names: /PDF, /Text, /ImageB, /ImageC, /ImageI; 9.1 Procedure Sets, p. 574

# /PieceInfo dictionary

See section 9.4 Page-Piece Dictionaries, page 581.

The /PieceInfo dictionary can contain many dictionaries, each keyed by the name of a distinct application, f.e. /emacs, or a "well-known" data type. Each sub-dictionary has two keys:

/LastModified
date/time this page was modified by the assoicated application
/Private
the info, typically as a dictionary

Therefore, the info is in a sub-sub-dictionary.

/emacs
/LastModified
(D:20180917060000Z00'00')
/Private
/whatever
(Some text.)
/something
(More info.)

# /Type /XObject /Subtype /Image

An image object is a dictionary with information about the image followed by the stream data for the image. The stream of data for a JPEG image is simple the exact same contents of the JPEG file, and its color space might be embedded (see B.4 on p.85 of the ICC1v43_2010-12.pdf).

In addition to the keys for a stream dictionary:

/Width
(REQUIRED) an integer for the width in samples (pixels)
/Height
(REQUIRED) an integer for the height in samples (pixels)
/ColorSpace
any type of color space except /Pattern; REQUIRED, but disallowed when /ImageMask is true
/BitsPerComponent

1, 2, 4, or 8. REQUIRED, but optional for image masks (and then it's 1). Always 8 when stream /Filter is /DCTDecode (JPEG).

/Intent
(1.1) /AbsoluteColorimetric, /RelativeColorimetric, /Saturation, or /Perceptual
/Interpolate
boolean, default false; approves attempts for smoothing small images (few pixels) across a large space (many pixels)

An image is in "image space", which means it is one unit wide by one unit high. Therefore, a transformation matrix for scaling (using width 0 0 height 0 0 with cm in a content stream) is necessary in order for it to be large enough to be visible. Making the width and height of the painted image (from Do in a content stream) the same as the /MediaBox for a /Page will paint the page completely with the image.

# /ColorSpace

A color space is defined by an array of the color space family name followed by its parameters, if any. Some color space families are without parameters and therefore those can be specified as simply their names instead of an array.

/DeviceGray
/DeviceRGB
/DeviceCMYK
Color space families without parameters.
[/ICCBased stream]
(1.3) a cross-platform color profile defined by the ICC as a stream; additional keys for the stream dictionary:
/N
(REQUIRED) 1, 3, or 4; number of color components, f.e. RGB has 3
/Alternate
a name or array of names for any color space other than /Pattern; when omitted it's derived from /N, respectively: /DeviceGray, /DeviceRGB, or /DeviceCMYK

# summary

A basic foundation of objects for a PDF document. Notably, they are indirect dictionary objects and all are ultimately discovered from the trailer dictionary (a direct object).

trailer dictionary
/Info
document info dictionary (indirect reference)
/Root
/Type/Catalog dictionary (indirect reference)
/Pages
/Type/Pages dictionary (indirect reference)
/Kids
array of indirect references to /Page objects
/Outlines
/Type/Outlines dictionary (indirect reference)

Essentially, all that remains is to create some /Page objects for the /Kids array of the main /Pages dictionary, and optionally some outline item dictionaries for the /Outlines dictionary. Technically, a "page" can be whatever size, therefore there could be just one /Page object with everything on it, infinitely scrollable like an HTML document.

Afterwards, create the cross reference table listing the byte offsets for all the indirect objects, set the /Size key of the trailer dictionary, and set the byte offset for startxref.

# keyboard macros for emacs

Keyboard macros used in emacs (aka Editing MACroS).

update byte offset for an object

Keyboard macro for updating the byte offsets in the cross reference table.

Requires register "8" to have the point at the first line of the table in the cross reference table, which is object 0. Also note that ESC is normally represented as the META modifier key, abbreviated to M-, in macros.

Start with the cursor at the point after the keyword endobj, which is conventionally the end of a line. This is also one character before the label of the next object on the next line. The value of the command `point' will be the byte offset from the beginning of the file for the next object because it is the position of the character at that point. (Notice the `point' command at the beginning of the file returns the value "1", not "0", therefore its the position and the cursor must be positioned one character before.)

;Store the point value of the position before the object
;into a register.
M-:			;; eval-expression
M-(			;; ()
set-register		;; (set-register)
SPC			;; (set-register )
?h			;; (set-register ?h)
M-(			;; (set-register ?h())
point			;; (set-register ?h(point))
RET

;Get the label for the object, which is a number, from the
;next line.
;This will be used to skip forward that number of lines
;from the beginning of the cross reference table.
C-f			;; forward-char
C-x r n			;; number-to-register
n

;Move to the next object, which would be at the end of this
;one if there is another.
C-s			;; isearch-forward
endobj
RET

;Store point to a register for returning the cursor to this
;position when done for beginning the macro again.
C-x r SPC		;; point-to-register
9

;Jump to the point of the beginning of the first line of the
;table in the cross reference table, which is object number 0.
;Remember: this has to be set before using this macro.
C-x r j			;; jump-to-register
8

;The label of the object is also the number of its line in
;the cross reference table.
M-:			;; eval-expression
M-(			;; ()
forward-line		;; (forward-line)
SPC			;; (forward-line )
C-x r i			;; insert-register
n			;; This will be the label stored earlier.
RET

;Select the byte offset and delete.
M-F			;; forward-word, and select
DEL			;; delete-backward-char

;Insert the new 10-digit byte offset prefixed with zeros.
M-:			;; eval-expression
M-(			;; ()
insert			;; (insert)
M-(			;; (insert())
format"%010d"		;; (insert(format"%010d"))
C-x r i			;; insert-register
h			;; The value of point before the object.
RET

;Jump back to the end of this object in preparation for
;using this macro again, on the next object.
C-x r j			;; jump-to-register
9

begin