forked from py-pdf/pypdf
-
Notifications
You must be signed in to change notification settings - Fork 0
Fix this off-by-one error #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
speedplane
wants to merge
167
commits into
master
Choose a base branch
from
feature/ASCII85-Off-By-One
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Adapted from work by Sylvain Pelissier (@sylvainpelissier) http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python Script works but has limited range of image types it is successful with. Future commits will have sample PDFs and notes about what works/fails.
```
> python pdf-image-extractor.py ..\PDF_Samples\GeoBase_NHNC1_Data_Model_UML_EN.pdf
Traceback (most recent call last):
File "pdf-image-extractor.py", line 33, in <module>
img = Image.frombytes(mode, size, data)
File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 2047, in frombytes
im.frombytes(data, decoder_name, args)
File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 731, in frombytes
raise ValueError("not enough image data")
ValueError: not enough image data
```
Source:
http://ftp2.cits.rncan.gc.ca/pub/geobase/official/nhn_rhn/doc/
"""
All distributed data are subject to the Open Government Licence – Canada.
Canada grants to the licensee a non-exclusive, fully paid, royalty-free
right and licence to exercise all intellectual property rights in the
data. This includes the right to use, incorporate, sublicense (with
further right of sublicensing), modify, improve, further develop, and
distribute the Data; and to manufacture or distribute derivative
products.
-- http://www.nrcan.gc.ca/earth-sciences/geography/topographic-information/free-data-geogratis/licence/17285
"""
Image extractor script with sample failing pdf
Travis CI picture.
…EIQFeature-1 Fix a bug in _readInlineImage
Uses same structure as addLink addURI
…nd_paeth_filter Add support for PNG filters average and paeth
…into JohnMulligan-URI-linking
Prevent infinite loop in readObject() function
Changes readStringFromStream to use a dict of escapes rather than a long if/else chain. (should lead to speed up, and looks cleaner)
The previous check was always evaluated to False on Python 3, so I replaced it with a duck-typing one compatible with both Python versions.
* Explicitly export PdfFileReader, PdfFileWriter * Implicit string concatenation * Don't leave open file handles * Apply hints from flake8-simplify * Only import stuff that is used
Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>
Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>
* Replace pytest-cov by coverage * Fix coverage badge
Adding unit Tests:
* xmp
* ConvertFunctionsToVirtualList
* PyPDF2.utils.hexStr
* Page operations with encoded file
* merging encrypted
* images
DOC: Comments to docstrings
STY: Remove vim comments
BUG: CCITTFaxDecode decodeParms can be an ArrayObject.
I don't know how a good solution would look like. Now it doesn't throw an error, but the result might be wrong.
BUG: struct was not imported for Python 2.X
Credits to Sebastian Krause for creating the PDF: py-pdf#331 (comment) Co-authored-by: Sebastian Krause <sebastian@realpath.org>
Closes py-pdf#329 - potential infinite loop (SEC) Closes py-pdf#330 - performance issue of ContentStream._readInlineImage (PERF)
Security (SEC): - ContentStream_readInlineImage had potential infinite loop (py-pdf#740) Bug fixes (BUG): - Fix merging encrypted files (py-pdf#757) - CCITTFaxDecode decodeParms can be an ArrayObject (py-pdf#756) Robustness improvements (ROBUST): - title sometimes None (py-pdf#744) Documentation (DOC): - Adjust short description of the package Tests and Test setup (TST): - Rewrite JS tests from unittest to pytest (py-pdf#746) - Increase Test coverage, mainly with filters (py-pdf#756) - Add test for inline images (py-pdf#758) Developer Experience Improvements (DEV): - Remove unused Travis-CI configuration (py-pdf#747) - Show code coverage (py-pdf#754, py-pdf#755) - Add mutmut (py-pdf#760) Miscellaneous: - STY: Closing file handles, explicit exports, ... (py-pdf#743) All changes: py-pdf/pypdf@1.27.4...1.27.5
ISSUE: Problem appears because _flatten() method sets self.flattenedPages before it tries to get pages and doesn't set it back to None in case of error. This PR just makes _flatten() to set self.flattenedPages to an empty array after it successfully got pages. FIX: Call `self.flattenedPages` after calling `catalog["/Pages"].getObject()` Closes py-pdf#327
Credits to Denis Osipov: py-pdf#359 (comment) Co-authored-by: Denis Osipov <osipov_d@list.ru>
) The header being read has the format: <idnum> <generation> obj where `<idnum>` and `<generation>` are integers. Previously an arbitrary number of spaces was being allowed between `<idnum>` and `<generation>`, but not between `<generation>` and `obj`. We now allow arbitrary spaces between `<generation>` and `obj`.
This allows us to leverage the IDE. * Documentation: We can now document what the constants are good for and give background information around them * Homographs: We can distinguish literals which have the same name, but different contexts * Typos: We can hopefully avoid typos like decodeParams -> decodeParms. For users of PyPDF2, this doesn't change anything. We still use string literals. For documentation we should also keep doing that.
This helps users who run into issue py-pdf#67
Fixes bug where decodeParms.get(...) causes AttributeError: 'ArrayObject' object has no attribute 'get' Closes py-pdf#404
Added optional parameter in readNextEndLine() to limit the offset then read() uses this parameter to limit the reading to last1K Closes py-pdf#639 Closes py-pdf#439
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ASCII 85 can represent 2^32-1. This error was causing validations to break.