Advanced topics

Paperless offers a couple features that automate certain tasks and make your life easier.

Guesswork

Any document you put into the consumption directory will be consumed, but if you name the file right, it’ll automatically set some values in the database for you. This is is the logic the consumer follows:

  1. Try to find the correspondent, title, and tags in the file name following the pattern: Date - Correspondent - Title - tag,tag,tag.pdf. Note that the format of the date is rigidly defined as YYYYMMDDHHMMSSZ or YYYYMMDDZ. The Z refers “Zulu time” AKA “UTC”. The tags are optional, so the format Date - Correspondent - Title.pdf works as well.

  2. If that doesn’t work, we skip the date and try this pattern: Correspondent - Title - tag,tag,tag.pdf.

  3. If that doesn’t work, we try to find the correspondent and title in the file name following the pattern: Correspondent - Title.pdf.

  4. If that doesn’t work, just assume that the name of the file is the title.

So given the above, the following examples would work as you’d expect:

  • 20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf

  • 20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf

  • Some Company Name - Invoice 2016-01-01 - money,invoices.pdf

  • Another Company - Letter of Reference.jpg

  • Dad's Recipe for Pancakes.png

These however wouldn’t work:

  • 2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf

  • 2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf

  • Some Company Name, Invoice 2016-01-01, money, invoices.pdf

  • Another Company- Letter of Reference.jpg

Do I have to be so strict about naming?

Rather than using the strict document naming rules, one can also set the option PAPERLESS_FILENAME_DATE_ORDER in paperless.conf to any date order that is accepted by dateparser. Doing so will cause paperless to default to any date format that is found in the title, instead of a date pulled from the document’s text, without requiring the strict formatting of the document filename as described above.

Transforming filenames for parsing

Some devices can’t produce filenames that can be parsed by the default parser. By configuring the option PAPERLESS_FILENAME_PARSE_TRANSFORMS in paperless.conf one can add transformations that are applied to the filename before it’s parsed.

The option contains a list of dictionaries of regular expressions (key: pattern) and replacements (key: repl) in JSON format, which are applied in order by passing them to re.subn. Transformation stops after the first match, so at most one transformation is applied. The general syntax is

[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]

The example below is for a Brother ADS-2400N, a scanner that allows different names to different hardware buttons (useful for handling multiple entities in one instance), but insists on adding _<count> to the filename.

# Brother profile configuration, support "Name_Date_Count" (the default
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]

Matching tags, correspondents and document types

After the consumer has tried to figure out what it could from the file name, it starts looking at the content of the document itself. It will compare the matching algorithms defined by every tag and correspondent already set in your database to see if they apply to the text in that document. In other words, if you defined a tag called Home Utility that had a match property of bc hydro and a matching_algorithm of literal, Paperless will automatically tag your newly-consumed document with your Home Utility tag so long as the text bc hydro appears in the body of the document somewhere.

The matching logic is quite powerful, and supports searching the text of your document with different algorithms, and as such, some experimentation may be necessary to get things right.

In order to have a tag, correspondent or type assigned automatically to newly consumed documents, assign a match and matching algorithm using the web interface. These settings define when to assign correspondents, tags and types to documents.

The following algorithms are available:

  • Any: Looks for any occurrence of any word provided in match in the PDF. If you define the match as Bank1 Bank2, it will match documents containing either of these terms.

  • All: Requires that every word provided appears in the PDF, albeit not in the order provided.

  • Literal: Matches only if the match appears exactly as provided in the PDF.

  • Regular expression: Parses the match as a regular expression and tries to find a match within the document.

  • Fuzzy match: I dont know. Look at the source.

  • Auto: Tries to automatically match new documents. This does not require you to set a match. See the notes below.

When using the “any” or “all” matching algorithms, you can search for terms that consist of multiple words by enclosing them in double quotes. For example, defining a match text of "Bank of America" BofA using the “any” algorithm, will match documents that contain either “Bank of America” or “BofA”, but will not match documents containing “Bank of South America”.

Then just save your tag/correspondent and run another document through the consumer. Once complete, you should see the newly-created document, automatically tagged with the appropriate data.

Automatic matching

Paperless-ng comes with a new matching algorithm called Auto. This matching algorithm tries to assign tags, correspondents and document types to your documents based on how you have assigned these on existing documents. It uses a neural network under the hood.

If, for example, all your bank statements of your account 123 at the Bank of America are tagged with the tag “bofa_123” and the matching algorithm of this tag is set to Auto, this neural network will examine your documents and automatically learn when to assign this tag.

Paperless tries to hide much of the involved complexity with this approach. However, there are a couple caveats you need to keep in mind when using this feature:

  • Changes to your documents are not immediately reflected by the matching algorithm. The neural network needs to be trained on your documents after changes. Paperless periodically (default: once each hour) checks for changes and does this automatically for you.

  • The Auto matching algorithm only takes documents into account which are NOT placed in your inbox (i.e., have inbox tags assigned to them). This ensures that the neural network only learns from documents which you have correctly tagged before.

  • The matching algorithm can only work if there is a correlation between the tag, correspondent or document type and the document itself. Your bank statements usually contain your bank account number and the name of the bank, so this works reasonably well, However, tags such as “TODO” cannot be automatically assigned.

  • The matching algorithm needs a reasonable number of documents to identify when to assign tags, correspondents, and types. If one out of a thousand documents has the correspondent “Very obscure web shop I bought something five years ago”, it will probably not assign this correspondent automatically if you buy something from them again. The more documents, the better.

  • Paperless also needs a reasonable amount of negative examples to decide when not to assign a certain tag, correspondent or type. This will usually be the case as you start filling up paperless with documents. Example: If all your documents are either from “Webshop” and “Bank”, paperless will assign one of these correspondents to ANY new document, if both are set to automatic matching.

Hooking into the consumption process

Sometimes you may want to do something arbitrary whenever a document is consumed. Rather than try to predict what you may want to do, Paperless lets you execute scripts of your own choosing just before or after a document is consumed using a couple simple hooks.

Just write a script, put it somewhere that Paperless can read & execute, and then put the path to that script in paperless.conf with the variable name of either PAPERLESS_PRE_CONSUME_SCRIPT or PAPERLESS_POST_CONSUME_SCRIPT.

Important

These scripts are executed in a blocking process, which means that if a script takes a long time to run, it can significantly slow down your document consumption flow. If you want things to run asynchronously, you’ll have to fork the process in your script and exit.

Pre-consumption script

Executed after the consumer sees a new document in the consumption folder, but before any processing of the document is performed. This script receives exactly one argument:

  • Document file name

A simple but common example for this would be creating a simple script like this:

/usr/local/bin/ocr-pdf

#!/usr/bin/env bash
pdf2pdfocr.py -i ${1}

/etc/paperless.conf

...
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
...

This will pass the path to the document about to be consumed to /usr/local/bin/ocr-pdf, which will in turn call pdf2pdfocr.py on your document, which will then overwrite the file with an OCR’d version of the file and exit. At which point, the consumption process will begin with the newly modified file.

Post-consumption script

Executed after the consumer has successfully processed a document and has moved it into paperless. It receives the following arguments:

  • Document id

  • Generated file name

  • Source path

  • Thumbnail path

  • Download URL

  • Thumbnail URL

  • Correspondent

  • Tags

The script can be in any language you like, but for a simple shell script example, you can take a look at post-consumption-example.sh in the scripts directory in this project.

The post consumption script cannot cancel the consumption process.

File name handling

By default, paperless stores your documents in the media directory and renames them using the identifier which it has assigned to each document. You will end up getting files like 0000123.pdf in your media directory. This isn’t necessarily a bad thing, because you normally don’t have to access these files manually. However, if you wish to name your files differently, you can do that by adjusting the PAPERLESS_FILENAME_FORMAT settings variable.

This variable allows you to configure the filename (folders are allowed!) using placeholders. For example, setting

PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}

will create a directory structure as follows:

2019/
  my_bank/
    statement-january-0000001.pdf
    statement-february-0000002.pdf
2020/
  my_bank/
    statement-january-0000003.pdf
  shoe_store/
    my_new_shoes-0000004.pdf

Paperless appends the unique identifier of each document to the filename. This avoids filename clashes.

Danger

Do not manually move your files in the media folder. Paperless remembers the last filename a document was stored as. If you do rename a file, paperless will report your files as missing and won’t be able to find them.

Paperless provides the following placeholders withing filenames:

  • {correspondent}: The name of the correspondent, or “none”.

  • {title}: The title of the document.

  • {created}: The full date and time the document was created.

  • {created_year}: Year created only.

  • {created_month}: Month created only (number 1-12).

  • {created_day}: Day created only (number 1-31).

  • {added}: The full date and time the document was added to paperless.

  • {added_year}: Year added only.

  • {added_month}: Month added only (number 1-12).

  • {added_day}: Day added only (number 1-31).

  • {tags}: I don’t know how this works. Look at the source.

Paperless will convert all values for the placeholders into values which are safe for use in filenames.

Hint

Paperless checks the filename of a document whenever it is saved. Therefore, you need to update the filenames of your documents and move them after altering this setting by invoking the document renamer.

Warning

Make absolutely sure you get the spelling of the placeholders right, or else paperless will use the default naming scheme instead.

Caution

As of now, you could totally tell paperless to store your files anywhere outside the media directory by setting

PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}

However, keep in mind that inside docker, if files get stored outside of the predefined volumes, they will be lost after a restart of paperless.