How to use Python to extract the table, text in the PDF, save it in Excel format, 1.PDF, is the data we are about to process, then Adobe PDF, it has its own function, the PDF output to Excel format, but this function is charged, and there is a problem, that is, if you say that your PDF file, a lot, manual output, this is very expensive, then today we use Python, to automatically implement, this function, Come to the code, here, need to use Python's third-party library, pdfplumber, it is specially used to extract PDF content, before using, we need to go to install it, hold down win+r+cmd, input, pipinstallpdfplumber, install, pandas, this package, give it imported in, next, we can go to open the current PDF file, PDF, there are two cases, one is a page, One is multi-page, at this time, we need to call its properties, make a judgment, if its number of pages is greater than a page, we will store all the content to it in the tables list, if its content is exactly a page, we directly store in the tables variable, this storage, need to convert it to dataframe, this format, to be able to call toExcel, at this time, we come to print, this, tables, Should be a list type of output, then we see, now there is a list check set, there is a big list outside, inside, is a bunch of small lists, when we convert to dataframe, we need to fetch data, this, from an element, the beginning to get data, this is our Excel data, to tables, 0, the 0th element, that is, the title in our Excel, then we set it Here Collumns, in fact, it is dataframe, A column of this format, the index, here edge, set a parameter, yes, the Index to it turned off, because dataframe, it defaults to zero, one, two, this row index, we don't need it, set it to false, you can, to run it, we look at, the file, there is no saving success, has been saved successfully,
next, we look at, how to batch extract multiple PDF Fu files, we think about it, this, how to achieve, above, We are a single PDF, then when we modify, is it only necessary to write its path to it, the stored path, write it alive on it, well, to take this idea, let's see, how to implement, need to import Python's built-in library, an os, a globe, first give him a path, this path, we store on the desktop, PDF, such a folder, there are many PDFs, at this time, we need to go through, This PDF, then call this globe, its role is to put all the PDF suffix name files, its path, print it out, that, the current f, is under this folder, all the PDF paths, that, at this time, we are writing a function, the path is passed in, this function, it can be, we are the above code encapsulation, implementation, then let's look at it, here, we passed in a path, is this f, this path, next, This piece of code, and our above code, is actually the same, we only modify two places, one is open, this path, next, there is a file name of the excel we want to store, this, I extracted it, this name, is the name of the current PDF of mine, just modified a suffix name for it, and stored it in the data folder of the desktop. Here the curly brace format string is formatted, and the filename is passed into the curly brace. Run the display. Save successfully. Well, take a look. The folder for data. At this time, we will convert all the PDFs into batches.