PDF has many different capabilities that can be confusing. This article explains some of them.
***
Mr. Rick Borstein of Adobe and the Acrobat for Legal Professionals blog has given us permission to reprint his 2005 article, Understanding "Flavors" of PDF below. We thank him for his support.
Understanding "Flavors" of PDF
Most people know that
Acrobat files can contain a variety of types of information: text, images, and
OCR’d information.
Each of these is a
“flavor” of PDF with different capabilities and issues. PDF flavors are behind
some oft-heard questions I receive such as:
·
Why isn’t this PDF
searchable?
·
Why is this PDF 50K and
this one is 10K?
·
Why does this PDF print
slowly?
·
Why does this PDF look
funny on screen?
·
Why can’t I select text
from this PDF?
Not all PDFs are created
equal. Some PDFs are more usable or offer benefits that other typed do no.
I’ll examine the
different flavors below and make some recommendations.
Why does this matter?
If you choose the wrong
flavor of PDF or compression, you may run into the following problems:
·
Wasteful storage on
local computer and networks by using the wrong type of compression
·
Longer print times
·
Excessively large PDFs
that you intend to eFile making it difficult to meet court rules
I’ve met with a great
many law firms and have seen some pretty wacky methods of creating PDF. It is
not uncommon to see someone print out a Word document and then scan it back in
to create a PDF! Ack!
Flavors of PDF
The table below
discusses the four basic flavors of PDF:
PDF Normal
|
PDF Image
Only |
PDF Image+Text
|
Combination
|
|
What
is it?
|
Often called an “electronic PDF”,
this type of PDF has never hit paper and was converted directly from an
electronic source.
|
An image in a PDF wrapper. Could
be an image of a page of text or a JPEG, etc. inside a PDF.
|
An image inside a PDF with an
invisible layer of searchable text.
|
Any of the types at left.
|
Where
does it come from?
|
Produced directly from a software
application by “printing” to PDF or using the 1-button PDF creators supplied
by Acrobat
|
Scanners, Digital Copy Machines,
TIFFs converted to PDF.
|
An image-only file that has been
OCR’d using Acrobat Standard or Professional.
|
Create from Multiple Files in
Acrobat allows you to combine any kinds of PDFs together.
|
Is
it searchable?
|
Yes
100% accurate since no OCR has taken place |
No.
Does not contain any searchable text. |
Yes
OCR is not a perfect process. Do not expect 100% accuracy. |
Depends
If the combined PDFs are searchable, yes. |
Notes
|
Prints fastest.
Prints at best quality. Smallest file size. |
Recommend no more than 300dpi for
scanning. A good format to use in discovery when you don’t want to give the
other side an advantage.
|
Best way to make paper documents
searchable.
|
Can contain multiple document
sizes.
|
PDF Settings Affecting File Size
PDF Normal offers the
best performance, smallest file size and best searchability. These fully
electronic files contain all the fonts needed for printing. If you have an
option to create PDF Normal, always use it!
When creating PDFs from
paper, carefully choose your compression and scanning resolution.
There are three common
black & white compression algorithms used for scanned images:
File Size
|
Compression
|
Larger
| | | | | Smaller |
CCITT Group 4
|
JBIG2 Lossless
|
|
JBIG2 Lossy
|
If you choose Create PDF
from Scanner in Acrobat, the default compression is JBIG2 Lossless. This offers
a great balance between file size and quality.
Other hardware and
software products that scan to PDF generally use the CCITT Group 4 compression
which is considerable larger.
CCITT Group 4
compression was developed as a fax compression technology. The rudimentary
processors of fax machines in the early 1980s had just enough power to
decompress CCITT Group 4 files. Surprisingly, it is still widely used, but is
an inefficient compression scheme.
While rarely relevant in
the legal market, Acrobat is intelligent enough to compress files selectively
using Adaptive compression. A color brochure may have black text, a color image
and line art, each of which can have different compression schemes. If you need
to scan color brochures and the like– perhaps in an Intellectual Property
dispute– choose the Searchable Image-Compact option.
I’ve conducted several
visual tests on JBIG2 Lossless versus Lossy. It is difficult to detect the
differences between these two compression schemes on good quality scanned
documents. If you have good originals, go ahead and use the Lossy JBIG2.
File Size Comparison
The table below compares
the file sizes of a typical 8.5″ by 11″ legal document for various flavors of
PDF:
Single Page Legal Document – 200 DPI
|
||||
PDF Normal
|
PDF Image Only
200 dpi |
PDF Image Only
200 dpi |
PDF Image Only
200 dpi |
PDF+Text
200 dpi |
9.71K
|
40.79K
|
20.91K
|
9.4K
|
26.64K
|
Compression and Notes
|
||||
Fonts Embedded, no
tags
|
CCITT G4
|
JBIG2 Lossless
|
JBIG2 Lossy
|
JBIG2 Lossy
Compression
|
Single Page Legal Document – 300 DPI
|
||||
PDF Normal
|
PDF Image Only
300 dpi |
PDF Image Only
300 dpi |
PDF Image Only
300 dpi |
PDF+Text
300 dpi |
9.71K
|
53.77K
|
31.02K
|
10.7K
|
34.34K
|
Compression and Notes
|
||||
Fonts Embedded, no
tags
|
CCITT G4
|
JBIG2 Lossless
|
JBIG2 Lossy
|
JBIG2 Lossy
Compression
|
Testing Protocol
NOTE: I
did these tests back in the Acrobat 7 timeframe. Current versions of Acrobat
offer more robust compression (Adaptive Compression in Acrobat X) and generally
work better.
1.
The PDF Normal file was
created by choosing the Adobe PDF print driver. [Note 1]
2.
The PDF Normal file was
opened in Acrobat and saved as either 200 or 300 dpi uncompressed TIFFs.
3.
PDF Optimizer was used
to target three types of compression: CCITT G4, JPBIG2 Lossless and JBIG2
Lossy.
4.
All image and image+text
PDFs were created using Acrobat 7 by choosing Recognize Text Using OCR.
Recommendations
Here are my tips for
making the best choices when working with PDF files:
1.
Where did that PDF come from? You need to know . . .
Unless you scan it in yourself using the Create PDF from Scanner option in Acrobat, most likely your PDF file could be made a lot smaller using the PDF Optimizer in Acrobat Professional. Chances are the image-only and image+text PDFs you get from outside your firm use, old, inefficient CCITT Group 4 compression.
Unless you scan it in yourself using the Create PDF from Scanner option in Acrobat, most likely your PDF file could be made a lot smaller using the PDF Optimizer in Acrobat Professional. Chances are the image-only and image+text PDFs you get from outside your firm use, old, inefficient CCITT Group 4 compression.
2.
Keep Electronic Documents Electronic
Always convert electronic documents directly to PDF using the 1-button PDF Creators installed by Acrobat into Office applications or using the Adobe PDF print driver. You’ll have a considerably smaller file if you do so and searchability is much better.
Always convert electronic documents directly to PDF using the 1-button PDF Creators installed by Acrobat into Office applications or using the Adobe PDF print driver. You’ll have a considerably smaller file if you do so and searchability is much better.
3.
Scan at 300dpi, OCR and then Downsample if Necessary
You’ll get more accurate OCR scanning at 300 dpi. Always downsample and compress using the PDF Optimizer in Acrobat Professional after performing OCR. Acrobat Professional can also batch down-sample, too.
You’ll get more accurate OCR scanning at 300 dpi. Always downsample and compress using the PDF Optimizer in Acrobat Professional after performing OCR. Acrobat Professional can also batch down-sample, too.
4.
Try JBIG2 Lossy Compression
Although the Lossy word is a bit scary, give this compression scheme a try. Documents still look good on-screen and file sizes can be 50% smaller.
Although the Lossy word is a bit scary, give this compression scheme a try. Documents still look good on-screen and file sizes can be 50% smaller.
Notes
1. Multiple-page PDF Normal files are considerably smaller that
mult-page image-only PDFs. Single page PDF Normal files must contain all the
fonts necessary to render the page. This information does not need to be
duplicated for successive pages.
Hello,
ReplyDeleteThis blog is very informative , I am really pleased to post my comment on this blog .