0 Posted 2020-10-25Updated 2024-01-11Python3 minutes read (About 425 words)

pdf to txt| txt to pdf

如何用python提取pdf的文本內容

pdf2txt.py; extract the txt from pdf files.

PDF to text

1.Install

git clone https://github.com/pdfminer/pdfminer.six.git
python3 setup.py install

or
you can use pip

sudo pip3.7  install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfminer.six

2. Run

pdf2txt.py Papers/vilhelmsson2004.pdf| tail

backs to:

ken@ken-PC:~/Desktop$ pdf2txt.py Papers/vilhelmsson2004.pdf| tail -n 20
Fotsis T & Mann M (1996) Femtomole sequencing of proteins
from polyacrylamide gels by nano-electrospray mass spec-
trometry. Nature 379, 466– 469.

Wilson RP & Cowey CB (1985) Amino acid composition of
whole-body tissue of rainbow trout and Atlantic salmon. Aqua-
culture 48, 373– 376.

Wing SS, Haas AL & Goldberg AL (1995) Increase in ubiquitin –
protein conjugates concomitant with the increase in proteolysis
in rat skeletal muscle during starvation and atrophy denerva-
tion. Biochem J 307, 639–645.

Yamamoto T, Shima T, Furuita H & Suzuki N (2002) Inﬂuence of
feeding diets with and without ﬁsh meal by hand and by self-
feeders on feed intake, growth and nutrient utilization of juven-
ile rainbow trout (Oncorhynchus mykiss). Aquaculture 214,
289– 305.
...

Compare to raw file:

© vilhelmsson 2004

PDF to html

output as html file:

pdf2txt.py -o test.html Papers/vilhelmsson2004.pdf

Text to PDF

Profile: mkumarchaudhary06

Install

sudo pip3.7  install -i https://pypi.tuna.tsinghua.edu.cn/simple fpdf

Quick Start

from fpdf import FPDF

pdf = FPDF()   # save FPDF() class into a variable pdf
pdf.add_page() # Add a page

pdf.set_font("Arial", size = 15) # set style and size of font that you want in the pdf

pdf.cell(200, 10, txt = "GeeksforGeeks",  
         ln = 1, align = 'C') # create a cell
pdf.cell(200, 10, txt = "A Computer Science portal for geeks.",
         ln = 2, align = 'C') # add another cell

pdf.output("GFG.pdf")         # save the pdf with name .pdf

Output:
GFG.pdf

It works, but you need to slicing the sentences before running this codes or the contents will run out of the page.

By solving this problem, there is a script in github: baruchel
. It’s not perfect but it works.

git clone https://github.com/baruchel/txt2pdf.git
txt2pdf -s 12 -o document.pdf document.txt

pdf to txt| txt to pdf

https://karobben.github.io/2020/10/25/Blog/pdf2txt/

Author

Karobben

Posted on

2020-10-25

Updated on

2024-01-11

Licensed under

#Python Tools

pdf to txt| txt to pdf

如何用python提取pdf的文本內容

PDF to text

1.Install

2. Run

PDF to html

Text to PDF

Install

Quick Start

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

Tags

Subscribe for updates

Links

Recommends

Categories