tesseract  3.05.00
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 #include "config_auto.h"
21 #endif
22 
23 #include "allheaders.h"
24 #include "baseapi.h"
25 #include "math.h"
26 #include "renderer.h"
27 #include "strngs.h"
28 #include "tprintf.h"
29 
30 #ifdef _MSC_VER
31 #include "mathfix.h"
32 #endif
33 
34 /*
35 
36 Design notes from Ken Sharp, with light editing.
37 
38 We think one solution is a font with a single glyph (.notdef) and a
39 CIDToGIDMap which maps all the CIDs to 0. That map would then be
40 stored as a stream in the PDF file, and when flate compressed should
41 be pretty small. The font, of course, will be approximately the same
42 size as the one you currently use.
43 
44 I'm working on such a font now, the CIDToGIDMap is trivial, you just
45 create a stream object which contains 128k bytes (2 bytes per possible
46 CID and your CIDs range from 0 to 65535) and where you currently have
47 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
48 
49 Note that if, in future, you were to use a different (ie not 2 byte)
50 CMap for character codes you could trivially extend the CIDToGIDMap.
51 
52 The following is an explanation of how some of the font stuff works,
53 this may be too simple for you in which case please accept my
54 apologies, its hard to know how much knowledge someone has. You can
55 skip all this anyway, its just for information.
56 
57 The font embedded in a PDF file is usually intended just to be
58 rendered, but extensions allow for at least some ability to locate (or
59 copy) text from a document. This isn't something which was an original
60 goal of the PDF format, but its been retro-fitted, presumably due to
61 popular demand.
62 
63 To do this reliably the PDF file must contain a ToUnicode CMap, a
64 device for mapping character codes to Unicode code points. If one of
65 these is present, then this will be used to convert the character
66 codes into Unicode values. If its not present then the reader will
67 fall back through a series of heuristics to try and guess the
68 result. This is, as you would expect, prone to failure.
69 
70 This doesn't concern you of course, since you always write a ToUnicode
71 CMap, so because you are writing the text in text rendering mode 3 it
72 would seem that you don't really need to worry about this, but in the
73 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
74 attached to a font, so in order to get even copy/paste to work you
75 need to define a font.
76 
77 This is what leads to problems, tools like pdfwrite assume that they
78 are going to be able to (or even have to) modify the font entries, so
79 they require that the font being embedded be valid, and to be honest
80 the font Tesseract embeds isn't valid (for this purpose).
81 
82 
83 To see why lets look at how text is specified in a PDF file:
84 
85 (Test) Tj
86 
87 Now that looks like text but actually it isn't. Each of those bytes is
88 a 'character code'. When it comes to rendering the text a complex
89 sequence of events takes place, which converts the character code into
90 'something' which the font understands. Its entirely possible via
91 character mappings to have that text render as 'Sftu'
92 
93 For simple fonts (PostScript type 1), we use the character code as the
94 index into an Encoding array (256 elements), each element of which is
95 a glyph name, so this gives us a glyph name. We then consult the
96 CharStrings dictionary in the font, that's a complex object which
97 contains pairs of keys and values, you can use the key to retrieve a
98 given value. So we have a glyph name, we then use that as the key to
99 the dictionary and retrieve the associated value. For a type 1 font,
100 the value is a glyph program that describes how to draw the glyph.
101 
102 For CIDFonts, its a little more complicated. Because CIDFonts can be
103 large, using a glyph name as the key is unreasonable (it would also
104 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
105 as the key. CIDs are just numbers.
106 
107 But.... We don't use the character code as the CID. What we do is use
108 a CMap to convert the character code into a CID. We then use the CID
109 to key the CharStrings dictionary and proceed as before. So the 'CMap'
110 is the equivalent of the Encoding array, but its a more compact and
111 flexible representation.
112 
113 Note that you have to use the CMap just to find out how many bytes
114 constitute a character code, and it can be variable. For example you
115 can say if the first byte is 0x00->0x7f then its just one byte, if its
116 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
117 have seen CMaps defining character codes up to 5 bytes wide.
118 
119 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
120 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
121 a Glyph ID (GID) (and the LOCA table) which may well not be anything
122 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
123 the CIDs to GIDs, and we can then use the GID to get the glyph
124 description from the GLYF table of the font.
125 
126 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
127 
128 Looking at the PDF file I was supplied with we see that it contains
129 text like :
130 
131 <0x0075> Tj
132 
133 So we start by taking the character code (117) and look it up in the
134 CMap. Well you don't supply a CMap, you just use the Identity-H one
135 which is predefined. So character code 117 maps to CID 117. Then we
136 use the CIDToGIDMap, again you don't supply one, you just use the
137 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
138 were supplied with only contains 116 glyphs.
139 
140 Now for Latin that's not a huge problem, you can just supply a bigger
141 font. But for more complex languages that *is* going to be more of a
142 problem. Either you need to supply a font which contains glyphs for
143 all the possible CID->GID mappings, or we need to think laterally.
144 
145 Our solution using a TrueType CIDFont is to intervene at the
146 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
147 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
148 looking into now.
149 
150 It would also be possible to have a 'PostScript' (ie type 1 outlines)
151 CIDFont which contained 1 glyph, and a CMap which mapped all character
152 codes to CID 0. The effect would be the same.
153 
154 Its possible (I haven't checked) that the PostScript CIDFont and
155 associated CMap would be smaller than the TrueType font and associated
156 CIDToGIDMap.
157 
158 --- in a followup ---
159 
160 OK there is a small problem there, if I use GID 0 then Acrobat gets
161 upset about it and complains it cannot extract the font. If I set the
162 CIDToGIDMap so that all the entries are 1 instead, its happy. Totally
163 mad......
164 
165 */
166 
167 namespace tesseract {
168 
169 // Use for PDF object fragments. Must be large enough
170 // to hold a colormap with 256 colors in the verbose
171 // PDF representation.
172 const int kBasicBufSize = 2048;
173 
174 // If the font is 10 pts, nominal character width is 5 pts
175 const int kCharWidth = 2;
176 
177 /**********************************************************************
178  * PDF Renderer interface implementation
179  **********************************************************************/
180 
181 TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
182  : TessResultRenderer(outputbase, "pdf") {
183  obj_ = 0;
184  datadir_ = datadir;
185  offsets_.push_back(0);
186 }
187 
188 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
189  offsets_.push_back(objectsize + offsets_.back());
190  obj_++;
191 }
192 
193 void TessPDFRenderer::AppendPDFObject(const char *data) {
194  AppendPDFObjectDIY(strlen(data));
195  AppendString((const char *)data);
196 }
197 
198 // Helper function to prevent us from accidentally writing
199 // scientific notation to an HOCR or PDF file. Besides, three
200 // decimal points are all you really need.
201 double prec(double x) {
202  double kPrecision = 1000.0;
203  double a = round(x * kPrecision) / kPrecision;
204  if (a == -0)
205  return 0;
206  return a;
207 }
208 
209 long dist2(int x1, int y1, int x2, int y2) {
210  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
211 }
212 
213 // Viewers like evince can get really confused during copy-paste when
214 // the baseline wanders around. So I've decided to project every word
215 // onto the (straight) line baseline. All numbers are in the native
216 // PDF coordinate system, which has the origin in the bottom left and
217 // the unit is points, which is 1/72 inch. Tesseract reports baselines
218 // left-to-right no matter what the reading order is. We need the
219 // word baseline in reading order, so we do that conversion here. Returns
220 // the word's baseline origin and length.
221 void GetWordBaseline(int writing_direction, int ppi, int height,
222  int word_x1, int word_y1, int word_x2, int word_y2,
223  int line_x1, int line_y1, int line_x2, int line_y2,
224  double *x0, double *y0, double *length) {
225  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
226  Swap(&word_x1, &word_x2);
227  Swap(&word_y1, &word_y2);
228  }
229  double word_length;
230  double x, y;
231  {
232  int px = word_x1;
233  int py = word_y1;
234  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
235  if (l2 == 0) {
236  x = line_x1;
237  y = line_y1;
238  } else {
239  double t = ((px - line_x2) * (line_x2 - line_x1) +
240  (py - line_y2) * (line_y2 - line_y1)) / l2;
241  x = line_x2 + t * (line_x2 - line_x1);
242  y = line_y2 + t * (line_y2 - line_y1);
243  }
244  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
245  word_x2, word_y2)));
246  word_length = word_length * 72.0 / ppi;
247  x = x * 72 / ppi;
248  y = height - (y * 72.0 / ppi);
249  }
250  *x0 = x;
251  *y0 = y;
252  *length = word_length;
253 }
254 
255 // Compute coefficients for an affine matrix describing the rotation
256 // of the text. If the text is right-to-left such as Arabic or Hebrew,
257 // we reflect over the Y-axis. This matrix will set the coordinate
258 // system for placing text in the PDF file.
259 //
260 // RTL
261 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
262 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
263 void AffineMatrix(int writing_direction,
264  int line_x1, int line_y1, int line_x2, int line_y2,
265  double *a, double *b, double *c, double *d) {
266  double theta = atan2(static_cast<double>(line_y1 - line_y2),
267  static_cast<double>(line_x2 - line_x1));
268  *a = cos(theta);
269  *b = sin(theta);
270  *c = -sin(theta);
271  *d = cos(theta);
272  switch(writing_direction) {
274  *a = -*a;
275  *b = -*b;
276  break;
278  // TODO(jbreiden) Consider using the vertical PDF writing mode.
279  break;
280  default:
281  break;
282  }
283 }
284 
285 // There are some really awkward PDF viewers in the wild, such as
286 // 'Preview' which ships with the Mac. They do a better job with text
287 // selection and highlighting when given perfectly flat baseline
288 // instead of very slightly tilted. We clip small tilts to appease
289 // these viewers. I chose this threshold large enough to absorb noise,
290 // but small enough that lines probably won't cross each other if the
291 // whole page is tilted at almost exactly the clipping threshold.
292 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
293  int *line_x1, int *line_y1,
294  int *line_x2, int *line_y2) {
295  *line_x1 = x1;
296  *line_y1 = y1;
297  *line_x2 = x2;
298  *line_y2 = y2;
299  double rise = abs(y2 - y1) * 72 / ppi;
300  double run = abs(x2 - x1) * 72 / ppi;
301  if (rise < 2.0 && 2.0 < run)
302  *line_y1 = *line_y2 = (y1 + y2) / 2;
303 }
304 
305 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
306  double width, double height) {
307  STRING pdf_str("");
308  double ppi = api->GetSourceYResolution();
309 
310  // These initial conditions are all arbitrary and will be overwritten
311  double old_x = 0.0, old_y = 0.0;
312  int old_fontsize = 0;
313  tesseract::WritingDirection old_writing_direction =
315  bool new_block = true;
316  int fontsize = 0;
317  double a = 1;
318  double b = 0;
319  double c = 0;
320  double d = 1;
321 
322  // TODO(jbreiden) This marries the text and image together.
323  // Slightly cleaner from an abstraction standpoint if this were to
324  // live inside a separate text object.
325  pdf_str += "q ";
326  pdf_str.add_str_double("", prec(width));
327  pdf_str += " 0 0 ";
328  pdf_str.add_str_double("", prec(height));
329  pdf_str += " 0 0 cm /Im1 Do Q\n";
330 
331  int line_x1 = 0;
332  int line_y1 = 0;
333  int line_x2 = 0;
334  int line_y2 = 0;
335 
336  ResultIterator *res_it = api->GetIterator();
337  while (!res_it->Empty(RIL_BLOCK)) {
338  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
339  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
340  old_fontsize = 0; // Every block will declare its fontsize
341  new_block = true; // Every block will declare its affine matrix
342  }
343 
344  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
345  int x1, y1, x2, y2;
346  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
347  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
348  }
349 
350  if (res_it->Empty(RIL_WORD)) {
351  res_it->Next(RIL_WORD);
352  continue;
353  }
354 
355  // Writing direction changes at a per-word granularity
356  tesseract::WritingDirection writing_direction;
357  {
358  tesseract::Orientation orientation;
359  tesseract::TextlineOrder textline_order;
360  float deskew_angle;
361  res_it->Orientation(&orientation, &writing_direction,
362  &textline_order, &deskew_angle);
363  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
364  switch (res_it->WordDirection()) {
365  case DIR_LEFT_TO_RIGHT:
366  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
367  break;
368  case DIR_RIGHT_TO_LEFT:
369  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
370  break;
371  default:
372  writing_direction = old_writing_direction;
373  }
374  }
375  }
376 
377  // Where is word origin and how long is it?
378  double x, y, word_length;
379  {
380  int word_x1, word_y1, word_x2, word_y2;
381  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
382  GetWordBaseline(writing_direction, ppi, height,
383  word_x1, word_y1, word_x2, word_y2,
384  line_x1, line_y1, line_x2, line_y2,
385  &x, &y, &word_length);
386  }
387 
388  if (writing_direction != old_writing_direction || new_block) {
389  AffineMatrix(writing_direction,
390  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
391  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
392  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
393  pdf_str.add_str_double(" ", prec(c)); // . system for all
394  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
395  pdf_str.add_str_double(" ", prec(x)); // .
396  pdf_str.add_str_double(" ", prec(y)); // .
397  pdf_str += (" Tm "); // Place cursor absolutely
398  new_block = false;
399  } else {
400  double dx = x - old_x;
401  double dy = y - old_y;
402  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
403  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
404  pdf_str += (" Td "); // Relative moveto
405  }
406  old_x = x;
407  old_y = y;
408  old_writing_direction = writing_direction;
409 
410  // Adjust font size on a per word granularity. Pay attention to
411  // fontsize, old_fontsize, and pdf_str. We've found that for
412  // in Arabic, Tesseract will happily return a fontsize of zero,
413  // so we make up a default number to protect ourselves.
414  {
415  bool bold, italic, underlined, monospace, serif, smallcaps;
416  int font_id;
417  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
418  &serif, &smallcaps, &fontsize, &font_id);
419  const int kDefaultFontsize = 8;
420  if (fontsize <= 0)
421  fontsize = kDefaultFontsize;
422  if (fontsize != old_fontsize) {
423  char textfont[20];
424  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
425  pdf_str += textfont;
426  old_fontsize = fontsize;
427  }
428  }
429 
430  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
431  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
432  STRING pdf_word("");
433  int pdf_word_len = 0;
434  do {
435  const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
436  if (grapheme && grapheme[0] != '\0') {
437  GenericVector<int> unicodes;
438  UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
439  char utf16[20];
440  for (int i = 0; i < unicodes.length(); i++) {
441  int code = unicodes[i];
442  // Convert to UTF-16BE https://en.wikipedia.org/wiki/UTF-16
443  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
444  tprintf("Dropping invalid codepoint %d\n", code);
445  continue;
446  }
447  if (code < 0x10000) {
448  snprintf(utf16, sizeof(utf16), "<%04X>", code);
449  } else {
450  int a = code - 0x010000;
451  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
452  int low_surrogate = (0x03FF & a) + 0xDC00;
453  snprintf(utf16, sizeof(utf16), "<%04X%04X>",
454  high_surrogate, low_surrogate);
455  }
456  pdf_word += utf16;
457  pdf_word_len++;
458  }
459  }
460  delete []grapheme;
461  res_it->Next(RIL_SYMBOL);
462  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
463  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
464  double h_stretch =
465  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
466  pdf_str.add_str_double("", h_stretch);
467  pdf_str += " Tz"; // horizontal stretch
468  pdf_str += " [ ";
469  pdf_str += pdf_word; // UTF-16BE representation
470  pdf_str += " ] TJ"; // show the text
471  }
472  if (last_word_in_line) {
473  pdf_str += " \n";
474  }
475  if (last_word_in_block) {
476  pdf_str += "ET\n"; // end the text object
477  }
478  }
479  char *ret = new char[pdf_str.length() + 1];
480  strcpy(ret, pdf_str.string());
481  delete res_it;
482  return ret;
483 }
484 
486  char buf[kBasicBufSize];
487  size_t n;
488 
489  n = snprintf(buf, sizeof(buf),
490  "%%PDF-1.5\n"
491  "%%%c%c%c%c\n",
492  0xDE, 0xAD, 0xBE, 0xEB);
493  if (n >= sizeof(buf)) return false;
494  AppendPDFObject(buf);
495 
496  // CATALOG
497  n = snprintf(buf, sizeof(buf),
498  "1 0 obj\n"
499  "<<\n"
500  " /Type /Catalog\n"
501  " /Pages %ld 0 R\n"
502  ">>\n"
503  "endobj\n",
504  2L);
505  if (n >= sizeof(buf)) return false;
506  AppendPDFObject(buf);
507 
508  // We are reserving object #2 for the /Pages
509  // object, which I am going to create and write
510  // at the end of the PDF file.
511  AppendPDFObject("");
512 
513  // TYPE0 FONT
514  n = snprintf(buf, sizeof(buf),
515  "3 0 obj\n"
516  "<<\n"
517  " /BaseFont /GlyphLessFont\n"
518  " /DescendantFonts [ %ld 0 R ]\n"
519  " /Encoding /Identity-H\n"
520  " /Subtype /Type0\n"
521  " /ToUnicode %ld 0 R\n"
522  " /Type /Font\n"
523  ">>\n"
524  "endobj\n",
525  4L, // CIDFontType2 font
526  6L // ToUnicode
527  );
528  if (n >= sizeof(buf)) return false;
529  AppendPDFObject(buf);
530 
531  // CIDFONTTYPE2
532  n = snprintf(buf, sizeof(buf),
533  "4 0 obj\n"
534  "<<\n"
535  " /BaseFont /GlyphLessFont\n"
536  " /CIDToGIDMap %ld 0 R\n"
537  " /CIDSystemInfo\n"
538  " <<\n"
539  " /Ordering (Identity)\n"
540  " /Registry (Adobe)\n"
541  " /Supplement 0\n"
542  " >>\n"
543  " /FontDescriptor %ld 0 R\n"
544  " /Subtype /CIDFontType2\n"
545  " /Type /Font\n"
546  " /DW %d\n"
547  ">>\n"
548  "endobj\n",
549  5L, // CIDToGIDMap
550  7L, // Font descriptor
551  1000 / kCharWidth);
552  if (n >= sizeof(buf)) return false;
553  AppendPDFObject(buf);
554 
555  // CIDTOGIDMAP
556  const int kCIDToGIDMapSize = 2 * (1 << 16);
557  unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize];
558  for (int i = 0; i < kCIDToGIDMapSize; i++) {
559  cidtogidmap[i] = (i % 2) ? 1 : 0;
560  }
561  size_t len;
562  unsigned char *comp =
563  zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len);
564  delete[] cidtogidmap;
565  n = snprintf(buf, sizeof(buf),
566  "5 0 obj\n"
567  "<<\n"
568  " /Length %lu /Filter /FlateDecode\n"
569  ">>\n"
570  "stream\n",
571  (unsigned long)len);
572  if (n >= sizeof(buf)) {
573  lept_free(comp);
574  return false;
575  }
576  AppendString(buf);
577  long objsize = strlen(buf);
578  AppendData(reinterpret_cast<char *>(comp), len);
579  objsize += len;
580  lept_free(comp);
581  const char *endstream_endobj =
582  "endstream\n"
583  "endobj\n";
584  AppendString(endstream_endobj);
585  objsize += strlen(endstream_endobj);
586  AppendPDFObjectDIY(objsize);
587 
588  const char *stream =
589  "/CIDInit /ProcSet findresource begin\n"
590  "12 dict begin\n"
591  "begincmap\n"
592  "/CIDSystemInfo\n"
593  "<<\n"
594  " /Registry (Adobe)\n"
595  " /Ordering (UCS)\n"
596  " /Supplement 0\n"
597  ">> def\n"
598  "/CMapName /Adobe-Identify-UCS def\n"
599  "/CMapType 2 def\n"
600  "1 begincodespacerange\n"
601  "<0000> <FFFF>\n"
602  "endcodespacerange\n"
603  "1 beginbfrange\n"
604  "<0000> <FFFF> <0000>\n"
605  "endbfrange\n"
606  "endcmap\n"
607  "CMapName currentdict /CMap defineresource pop\n"
608  "end\n"
609  "end\n";
610 
611  // TOUNICODE
612  n = snprintf(buf, sizeof(buf),
613  "6 0 obj\n"
614  "<< /Length %lu >>\n"
615  "stream\n"
616  "%s"
617  "endstream\n"
618  "endobj\n", (unsigned long) strlen(stream), stream);
619  if (n >= sizeof(buf)) return false;
620  AppendPDFObject(buf);
621 
622  // FONT DESCRIPTOR
623  n = snprintf(buf, sizeof(buf),
624  "7 0 obj\n"
625  "<<\n"
626  " /Ascent %d\n"
627  " /CapHeight %d\n"
628  " /Descent -1\n" // Spec says must be negative
629  " /Flags 5\n" // FixedPitch + Symbolic
630  " /FontBBox [ 0 0 %d %d ]\n"
631  " /FontFile2 %ld 0 R\n"
632  " /FontName /GlyphLessFont\n"
633  " /ItalicAngle 0\n"
634  " /StemV 80\n"
635  " /Type /FontDescriptor\n"
636  ">>\n"
637  "endobj\n",
638  1000,
639  1000,
640  1000 / kCharWidth,
641  1000,
642  8L // Font data
643  );
644  if (n >= sizeof(buf)) return false;
645  AppendPDFObject(buf);
646 
647  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
648  if (n >= sizeof(buf)) return false;
649  FILE *fp = fopen(buf, "rb");
650  if (!fp) {
651  tprintf("Can not open file \"%s\"!\n", buf);
652  return false;
653  }
654  fseek(fp, 0, SEEK_END);
655  long int size = ftell(fp);
656  fseek(fp, 0, SEEK_SET);
657  char *buffer = new char[size];
658  if (fread(buffer, 1, size, fp) != size) {
659  fclose(fp);
660  delete[] buffer;
661  return false;
662  }
663  fclose(fp);
664  // FONTFILE2
665  n = snprintf(buf, sizeof(buf),
666  "8 0 obj\n"
667  "<<\n"
668  " /Length %ld\n"
669  " /Length1 %ld\n"
670  ">>\n"
671  "stream\n", size, size);
672  if (n >= sizeof(buf)) {
673  delete[] buffer;
674  return false;
675  }
676  AppendString(buf);
677  objsize = strlen(buf);
678  AppendData(buffer, size);
679  delete[] buffer;
680  objsize += size;
681  AppendString(endstream_endobj);
682  objsize += strlen(endstream_endobj);
683  AppendPDFObjectDIY(objsize);
684  return true;
685 }
686 
687 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
688  char *filename,
689  long int objnum,
690  char **pdf_object,
691  long int *pdf_object_size) {
692  size_t n;
693  char b0[kBasicBufSize];
694  char b1[kBasicBufSize];
695  char b2[kBasicBufSize];
696  if (!pdf_object_size || !pdf_object)
697  return false;
698  *pdf_object = NULL;
699  *pdf_object_size = 0;
700  if (!filename)
701  return false;
702 
703  L_COMP_DATA *cid = NULL;
704  const int kJpegQuality = 85;
705 
706  // TODO(jbreiden) Leptonica 1.71 doesn't correctly handle certain
707  // types of PNG files, especially if there are 2 samples per pixel.
708  // We can get rid of this logic after Leptonica 1.72 is released and
709  // has propagated everywhere. Bug discussion as follows.
710  // https://code.google.com/p/tesseract-ocr/issues/detail?id=1300
711  int format, sad;
712  findFileFormat(filename, &format);
713  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
714  Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
715  sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
716  pixDestroy(&p1);
717  } else {
718  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
719  }
720 
721  if (sad || !cid) {
722  l_CIDataDestroy(&cid);
723  return false;
724  }
725 
726  const char *group4 = "";
727  const char *filter;
728  switch(cid->type) {
729  case L_FLATE_ENCODE:
730  filter = "/FlateDecode";
731  break;
732  case L_JPEG_ENCODE:
733  filter = "/DCTDecode";
734  break;
735  case L_G4_ENCODE:
736  filter = "/CCITTFaxDecode";
737  group4 = " /K -1\n";
738  break;
739  case L_JP2K_ENCODE:
740  filter = "/JPXDecode";
741  break;
742  default:
743  l_CIDataDestroy(&cid);
744  return false;
745  }
746 
747  // Maybe someday we will accept RGBA but today is not that day.
748  // It requires creating an /SMask for the alpha channel.
749  // http://stackoverflow.com/questions/14220221
750  const char *colorspace;
751  if (cid->ncolors > 0) {
752  n = snprintf(b0, sizeof(b0),
753  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
754  cid->ncolors - 1, cid->cmapdatahex);
755  if (n >= sizeof(b0)) {
756  l_CIDataDestroy(&cid);
757  return false;
758  }
759  colorspace = b0;
760  } else {
761  switch (cid->spp) {
762  case 1:
763  colorspace = " /ColorSpace /DeviceGray\n";
764  break;
765  case 3:
766  colorspace = " /ColorSpace /DeviceRGB\n";
767  break;
768  default:
769  l_CIDataDestroy(&cid);
770  return false;
771  }
772  }
773 
774  int predictor = (cid->predictor) ? 14 : 1;
775 
776  // IMAGE
777  n = snprintf(b1, sizeof(b1),
778  "%ld 0 obj\n"
779  "<<\n"
780  " /Length %ld\n"
781  " /Subtype /Image\n",
782  objnum, (unsigned long) cid->nbytescomp);
783  if (n >= sizeof(b1)) {
784  l_CIDataDestroy(&cid);
785  return false;
786  }
787 
788  n = snprintf(b2, sizeof(b2),
789  " /Width %d\n"
790  " /Height %d\n"
791  " /BitsPerComponent %d\n"
792  " /Filter %s\n"
793  " /DecodeParms\n"
794  " <<\n"
795  " /Predictor %d\n"
796  " /Colors %d\n"
797  "%s"
798  " /Columns %d\n"
799  " /BitsPerComponent %d\n"
800  " >>\n"
801  ">>\n"
802  "stream\n",
803  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
804  group4, cid->w, cid->bps);
805  if (n >= sizeof(b2)) {
806  l_CIDataDestroy(&cid);
807  return false;
808  }
809 
810  const char *b3 =
811  "endstream\n"
812  "endobj\n";
813 
814  size_t b1_len = strlen(b1);
815  size_t b2_len = strlen(b2);
816  size_t b3_len = strlen(b3);
817  size_t colorspace_len = strlen(colorspace);
818 
819  *pdf_object_size =
820  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
821  *pdf_object = new char[*pdf_object_size];
822 
823  char *p = *pdf_object;
824  memcpy(p, b1, b1_len);
825  p += b1_len;
826  memcpy(p, colorspace, colorspace_len);
827  p += colorspace_len;
828  memcpy(p, b2, b2_len);
829  p += b2_len;
830  memcpy(p, cid->datacomp, cid->nbytescomp);
831  p += cid->nbytescomp;
832  memcpy(p, b3, b3_len);
833  l_CIDataDestroy(&cid);
834  return true;
835 }
836 
838  size_t n;
839  char buf[kBasicBufSize];
840  Pix *pix = api->GetInputImage();
841  char *filename = (char *)api->GetInputName();
842  int ppi = api->GetSourceYResolution();
843  if (!pix || ppi <= 0)
844  return false;
845  double width = pixGetWidth(pix) * 72.0 / ppi;
846  double height = pixGetHeight(pix) * 72.0 / ppi;
847 
848  // PAGE
849  n = snprintf(buf, sizeof(buf),
850  "%ld 0 obj\n"
851  "<<\n"
852  " /Type /Page\n"
853  " /Parent %ld 0 R\n"
854  " /MediaBox [0 0 %.2f %.2f]\n"
855  " /Contents %ld 0 R\n"
856  " /Resources\n"
857  " <<\n"
858  " /XObject << /Im1 %ld 0 R >>\n"
859  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
860  " /Font << /f-0-0 %ld 0 R >>\n"
861  " >>\n"
862  ">>\n"
863  "endobj\n",
864  obj_,
865  2L, // Pages object
866  width,
867  height,
868  obj_ + 1, // Contents object
869  obj_ + 2, // Image object
870  3L); // Type0 Font
871  if (n >= sizeof(buf)) return false;
872  pages_.push_back(obj_);
873  AppendPDFObject(buf);
874 
875  // CONTENTS
876  char* pdftext = GetPDFTextObjects(api, width, height);
877  long pdftext_len = strlen(pdftext);
878  unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext);
879  size_t len;
880  unsigned char *comp_pdftext =
881  zlibCompress(pdftext_casted, pdftext_len, &len);
882  long comp_pdftext_len = len;
883  n = snprintf(buf, sizeof(buf),
884  "%ld 0 obj\n"
885  "<<\n"
886  " /Length %ld /Filter /FlateDecode\n"
887  ">>\n"
888  "stream\n", obj_, comp_pdftext_len);
889  if (n >= sizeof(buf)) {
890  delete[] pdftext;
891  lept_free(comp_pdftext);
892  return false;
893  }
894  AppendString(buf);
895  long objsize = strlen(buf);
896  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
897  objsize += comp_pdftext_len;
898  lept_free(comp_pdftext);
899  delete[] pdftext;
900  const char *b2 =
901  "endstream\n"
902  "endobj\n";
903  AppendString(b2);
904  objsize += strlen(b2);
905  AppendPDFObjectDIY(objsize);
906 
907  char *pdf_object;
908  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
909  return false;
910  }
911  AppendData(pdf_object, objsize);
912  AppendPDFObjectDIY(objsize);
913  delete[] pdf_object;
914  return true;
915 }
916 
917 
919  size_t n;
920  char buf[kBasicBufSize];
921 
922  // We reserved the /Pages object number early, so that the /Page
923  // objects could refer to their parent. We finally have enough
924  // information to go fill it in. Using lower level calls to manipulate
925  // the offset record in two spots, because we are placing objects
926  // out of order in the file.
927 
928  // PAGES
929  const long int kPagesObjectNumber = 2;
930  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
931  n = snprintf(buf, sizeof(buf),
932  "%ld 0 obj\n"
933  "<<\n"
934  " /Type /Pages\n"
935  " /Kids [ ", kPagesObjectNumber);
936  if (n >= sizeof(buf)) return false;
937  AppendString(buf);
938  size_t pages_objsize = strlen(buf);
939  for (size_t i = 0; i < pages_.size(); i++) {
940  n = snprintf(buf, sizeof(buf),
941  "%ld 0 R ", pages_[i]);
942  if (n >= sizeof(buf)) return false;
943  AppendString(buf);
944  pages_objsize += strlen(buf);
945  }
946  n = snprintf(buf, sizeof(buf),
947  "]\n"
948  " /Count %d\n"
949  ">>\n"
950  "endobj\n", pages_.size());
951  if (n >= sizeof(buf)) return false;
952  AppendString(buf);
953  pages_objsize += strlen(buf);
954  offsets_.back() += pages_objsize; // manipulation #2
955 
956  // INFO
957  char* datestr = l_getFormattedDate();
958  n = snprintf(buf, sizeof(buf),
959  "%ld 0 obj\n"
960  "<<\n"
961  " /Producer (Tesseract %s)\n"
962  " /CreationDate (D:%s)\n"
963  " /Title (%s)"
964  ">>\n"
965  "endobj\n", obj_, TESSERACT_VERSION_STR, datestr, title());
966  lept_free(datestr);
967  if (n >= sizeof(buf)) return false;
968  AppendPDFObject(buf);
969  n = snprintf(buf, sizeof(buf),
970  "xref\n"
971  "0 %ld\n"
972  "0000000000 65535 f \n", obj_);
973  if (n >= sizeof(buf)) return false;
974  AppendString(buf);
975  for (int i = 1; i < obj_; i++) {
976  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
977  if (n >= sizeof(buf)) return false;
978  AppendString(buf);
979  }
980  n = snprintf(buf, sizeof(buf),
981  "trailer\n"
982  "<<\n"
983  " /Size %ld\n"
984  " /Root %ld 0 R\n"
985  " /Info %ld 0 R\n"
986  ">>\n"
987  "startxref\n"
988  "%ld\n"
989  "%%%%EOF\n",
990  obj_,
991  1L, // catalog
992  obj_ - 1, // info
993  offsets_.back());
994  if (n >= sizeof(buf)) return false;
995  AppendString(buf);
996  return true;
997 }
998 } // namespace tesseract
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
struct TessBaseAPI TessBaseAPI
Definition: capi.h:86
void AppendString(const char *s)
Definition: renderer.cpp:99
Definition: strngs.h:44
const int kCharWidth
T & back() const
virtual bool EndDocumentHandler()
TessPDFRenderer(const char *outputbase, const char *datadir)
virtual bool BeginDocumentHandler()
#define tprintf(...)
Definition: tprintf.h:31
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
virtual bool AddImageHandler(TessBaseAPI *api)
int size() const
Definition: genericvector.h:72
#define TESSERACT_VERSION_STR
Definition: baseapi.h:23
const int kBasicBufSize
double prec(double x)
const char * title() const
Definition: renderer.h:80
void Swap(T *p1, T *p2)
Definition: helpers.h:90
int push_back(T object)
void AppendData(const char *s, int len)
Definition: renderer.cpp:103
static bool UTF8ToUnicode(const char *utf8_str, GenericVector< int > *unicodes)
Definition: unichar.cpp:211
const char * GetInputName()
Definition: baseapi.cpp:942
int length() const
Definition: genericvector.h:79
long dist2(int x1, int y1, int x2, int y2)