tools-jpeg2pdf
The primary source code repository for these tools is https://github.com/albion2000/tools-jpeg2pdf
The releases are here : https://github.com/albion2000/tools-jpeg2pdf/releases
Tools to help massive conversions from page scans to pdf documents ready for ocr using tools like Adobe Acrobat DC
They are functional, tested with Windows 10, on about 14000 pages, special cases will probably happen to you. Please report in case of problems.
One purpose is to keep the tools as simple as possible (KISS), refraining from feature creep. It might be better to create a new derived tool instead of crippling the current tool.
Some small adaptations will be necessary for it to run with linux, small changes like replacing ‘' by ‘/’ for the path.
In short :
naming_conventions_files.py, naming_conventions_do_rename & naming_conventions_do_rename_fles.py
Read the installation_instructions.txt and readme_naming_conventions.txt
Copy ‘naming_conventions_files.py’, ‘naming_conventions_do_rename_files’ & ‘naming_conventions_do_rename.py’ to the root directory of your file tree.
The primary purpose of this tool is to ensure a longer lifetime to a directory tree by reducing the risk of it being corrupted over transfers between file systems.
It recursively parses sub directories. It is able to rename the directories in order to follow some strict conventions.
script for simulation, with no effect (by default) on the name of the directories and files, for validation purposes:
- naming_conventions_files.py
mode for renaming effectively directories:
- naming_conventions_do_rename.py
mode for renaming effectively files:
- naming_conventions_do_rename_files.py
you can also use naming_conventions_files on the command line to do everything with more options syntax for simulation, with no effect on the name of the files or directories, for validation purposes: naming_conventions_files.py
or naming_conventions_files.py -t
syntax for renaming effectively the files: naming_conventions_files.py -w
syntax for renaming effectively the directories: naming_conventions_files.py -d
additional option -t to use if one want dates to be moved to the front of the names and reorganized in YYYY-MM-DD format all 3 options can be combined and all combinations are meaningful
naming_conventions_do_rename is equivalent to a call to naming_conventions_files -d
naming_conventions_do_rename_files is equivalent to a call to naming_conventions_files -w
The use of this tool would be typically used before using scandir2pdf.py
Rules followed :
- tries to convert as best as it can anything in ascii characters
- go lower case (not enforced anymore from releases > v1.2.2)
- anything that is neither a letter nor a number nor a \ ‘-‘ ‘(‘ ‘)’ ‘#’ is replaced by _
- ae,oe handled, reduce ‘__’ to a single underscore, remove underscores at start and end.
- push dates to the front and make them look more like YYYY-MM-DD_
check_jpegs.py & check_jpegs_full.py
Read the installation_instructions.txt and readme_check_jpegs.txt
Copy ‘check_jpegs.py’ & ‘check_jpegs_full.py’ to the root directory of your file tree.
It recursively parses sub directories for .jpeg and .jpg files
For a quick check that the files are not fully corrupted. It is, on purpose, a fast check in order to help detect rapidly bad files. Only the headers are checked, the images are not decompressed. This is a quick way to detect bad jpeg files, before using scandir2pdf.py
- check_jpegs.py
For a full check, much slower (useful also for your family pictures)
- check_jpegs_full.py
Each ‘.’ shows that one more directory was parsed
When a file is reported corrupted, it does not mean that it is lost. Try to open it in your favorite image sw and save it back (using the best quality, in order to reduce compression losses). It is often enough. If you can’t open it, try to open it with other image tools. Not all handle file corruption the same way.
check_pdfs.py
Read the installation_instructions.txt
Copy ‘check_pdfs.py’ to the root directory of your file tree.
It recursively parses sub directories for .pdf files
Each ‘*’ shows that one more file was parsed
Some false positive are possible.
scandir2pdf.py
Read the installation_instructions.txt and readme_scandir2pdf.txt for recommendations and use
Copy ‘scandir2pdf.py’ to the root directory of your file tree.
It recursively parses sub directories for .jpeg and .jpg file
For each directory X, the tool scandir2pdf regroups the image files in a file named X.pdf. And moves it to the upper stage in the file tree.
It is built upon img2pdf, which ensures no jpeg recompression and the best possible quality.
Logs progress, errors and final report in the console and in the file “logParse.txt”
after scandir2pdf you would use an OCR software of your choice.
scandirpdf2txt.py
Read the installation_instructions.txt and readme_scandirpdf2txt.txt for recommendations and use
It recursively parses sub directories for .pdf files
Provided that these where previously processed with an OCR Optical Character Recognition software, it will extract their text into .txt files
count_pdf_pages.py
Read the installation_instructions.txt for the prerequisites
It recursively parses sub directories for .pdf files
It generates a logCountPages.txt file with one line per pdf document found. Each line contains the number of pages of the pdf document and the document path.
scandirpdf2cover.py
Read the installation_instructions.txt for the prerequisites
It recursively parses sub directories for .pdf files
For each .pdf, it generates a png file preview of the cover page. The png is placed in the same directory as the pdf.
Only that tool makes use of the wand python library
scandirpdf2coverjpg.py
Read the installation_instructions.txt for the prerequisites
It recursively parses sub directories for .pdf files
For each .pdf, it generates a jpg file preview of the cover page. The jpg is placed in the same directory as the pdf.
Only that tool makes use of the wand python library
scandirpdf2jpg.py & scandirpdf2png.py
Is the reverse operation of scandir2pdf, the images created will be named like page_0001.jpg …
scandirpdf2noocr.py
Recursively remove all the ocr text from the pdfs. Can be needed if your ocr sw happens to append its generated text to the one already present. This tool should only be used on pdf files that were the result of scans and were processed through OCR. Running this tool on a pdf that is a printout of .doc file will totally remove the text !
scandirjpg2pdf.py
Is almost like scandir2pdf expect that it will create one pdf per image. And will only behave like scandir2pdf on a directory, if a file named multi.txt is present