Skip to content

Conversation

@speedplane
Copy link
Owner

ASCII 85 can represent 2^32-1. This error was causing validations to break.

Henri Salo and others added 30 commits August 18, 2015 13:42
Adapted from work by Sylvain Pelissier (@sylvainpelissier)
http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python

Script works but has limited range of image types it is successful with.
Future commits will have sample PDFs and notes about what works/fails.
```
> python pdf-image-extractor.py ..\PDF_Samples\GeoBase_NHNC1_Data_Model_UML_EN.pdf
Traceback (most recent call last):
  File "pdf-image-extractor.py", line 33, in <module>
    img = Image.frombytes(mode, size, data)
  File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 2047, in frombytes
    im.frombytes(data, decoder_name, args)
  File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 731, in frombytes
    raise ValueError("not enough image data")
ValueError: not enough image data
```

Source:
http://ftp2.cits.rncan.gc.ca/pub/geobase/official/nhn_rhn/doc/

"""
All distributed data are subject to the Open Government Licence – Canada.

Canada grants to the licensee a non-exclusive, fully paid, royalty-free
right and licence to exercise all intellectual property rights in the
data. This includes the right to use, incorporate, sublicense (with
further right of sublicensing), modify, improve, further develop, and
distribute the Data; and to manufacture or distribute derivative
products.

-- http://www.nrcan.gc.ca/earth-sciences/geography/topographic-information/free-data-geogratis/licence/17285
"""
Image extractor script with sample failing pdf
Travis CI picture.
…EIQFeature-1

Fix a bug in _readInlineImage
Uses same structure as addLink
addURI
…nd_paeth_filter

Add support for PNG filters average and paeth
Prevent infinite loop in readObject() function
Changes readStringFromStream to use a dict of escapes rather than a long if/else chain. (should lead to speed up, and looks cleaner)
The previous check was always evaluated to False on Python 3, so I replaced it
with a duck-typing one compatible with both Python versions.
MartinThoma and others added 30 commits April 12, 2022 22:35
* Explicitly export PdfFileReader, PdfFileWriter
* Implicit string concatenation
* Don't leave open file handles
* Apply hints from flake8-simplify
* Only import stuff that is used
Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>
Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>
* Replace pytest-cov by coverage
* Fix coverage badge
Adding unit Tests:

* xmp
* ConvertFunctionsToVirtualList
* PyPDF2.utils.hexStr
* Page operations with encoded file
* merging encrypted
* images

DOC: Comments to docstrings
STY: Remove vim comments

BUG: CCITTFaxDecode decodeParms can be an ArrayObject. 
          I don't know how a good solution would look like. Now it doesn't throw an error, but the result might be wrong.
BUG: struct was not imported for Python 2.X
Credits to Sebastian Krause for creating the PDF:
py-pdf#331 (comment)

Co-authored-by: Sebastian Krause <sebastian@realpath.org>
Closes py-pdf#329 - potential infinite loop (SEC)
Closes py-pdf#330 - performance issue of ContentStream._readInlineImage (PERF)
Security (SEC):

- ContentStream_readInlineImage had potential infinite loop (py-pdf#740)

Bug fixes (BUG):

- Fix merging encrypted files (py-pdf#757)
- CCITTFaxDecode decodeParms can be an ArrayObject (py-pdf#756)

Robustness improvements (ROBUST):

- title sometimes None (py-pdf#744)

Documentation (DOC):

- Adjust short description of the package

Tests and Test setup (TST):

- Rewrite JS tests from unittest to pytest (py-pdf#746)
- Increase Test coverage, mainly with filters (py-pdf#756)
- Add test for inline images (py-pdf#758)

Developer Experience Improvements (DEV):

- Remove unused Travis-CI configuration (py-pdf#747)
- Show code coverage (py-pdf#754, py-pdf#755)
- Add mutmut (py-pdf#760)

Miscellaneous:

- STY: Closing file handles, explicit exports, ... (py-pdf#743)

All changes: py-pdf/pypdf@1.27.4...1.27.5
ISSUE: Problem appears because _flatten() method sets self.flattenedPages before it tries to get pages and doesn't set it back to None in case of error. This PR just makes _flatten() to set self.flattenedPages to an empty array after it successfully got pages.

FIX: Call `self.flattenedPages` after calling `catalog["/Pages"].getObject()`

Closes py-pdf#327
Credits to Denis Osipov:
py-pdf#359 (comment)

Co-authored-by: Denis Osipov <osipov_d@list.ru>
)

The header being read has the format:

    <idnum> <generation> obj

where `<idnum>` and `<generation>` are integers.
Previously an arbitrary number of spaces was being allowed between `<idnum>` and `<generation>`, but not between `<generation>` and `obj`.
We now allow arbitrary spaces between `<generation>` and `obj`.
This allows us to leverage the IDE.

* Documentation: We can now document what the constants are good for and give background information around them
* Homographs: We can distinguish literals which have the same name, but different contexts
* Typos: We can hopefully avoid typos like decodeParams -> decodeParms.

For users of PyPDF2, this doesn't change anything. We still use string literals. For documentation we should also keep doing that.
Fixes bug where decodeParms.get(...) causes
AttributeError: 'ArrayObject' object has no attribute 'get'

Closes py-pdf#404
Added optional parameter in readNextEndLine() to limit the offset
then read() uses this parameter to limit the reading to last1K

Closes py-pdf#639
Closes py-pdf#439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.