Document Classifier

Overview

VisualVault’s Document Classifier feature allows VisualVault implementors to create and train their own machine learning models, which will be used to automatically assign the Document Type index field.

Objective

The Document Classifier leverages an AI learning model to automatically assign the Document Type index field during a bulk upload or bulk scanning process. This process helps prevent human errors associated with incorrect designation, as the AI is trained on large volumes of documents specific to the customer.

When uploading a small batch of documents, a user may assign index fields manually for the batch; for large-scale bulk uploads containing a mixed set of document types, manual batch indexing becomes time-intensive and costly. The document classifier is VisualVault's solution to this issue by automatically sorting and assigning document types accordingly.

Description

Key Functionality

AI-Powered Classification: Uses an AI learning model to automatically assign the Document Type index field.
Bulk Upload Support: Operates during bulk upload or bulk scanning processes.
Error Reduction: Minimizes human errors in document type designation.
Customer-Specific Training: AI is trained on a large volume of customer-specific documents for improved accuracy.
Efficiency in Large Uploads: Eliminates the need for time-consuming and costly manual indexing during large-scale uploads.
Automatic Sorting: Automatically sorts and categorizes mixed document types within a bulk upload.

Quickstart Guide

New to the Document Classifier? Use this guide to learn the basics.

Walkthrough

Prerequisites

Determine which file types need classification
- All files in the customer database of the specified types will be classified.
- PDF and PNG are supposed file types.
Training Data
- You must have real-world sample files that have been manually labeled (Document Type value set).
- Training data must be uploaded or imported into the customer database’s VisualVault doc library with the Document Type index field specified (labeled).
- Recommend a minimum of 100 sample files per Document Type. This is only a guideline (more is always better).
- Upload documents to a folder in the Document Library. These documents' content will be used for training the classification machine learning model. Supported file types include PDF, DOCX, XLXS, PPT, TXT, and TIFF formats.
Index Fields
- Best Practice is to use the system generated “Document Type” field when configuration classification.
- Create a numeric Index Field named “Confidence” to store the auto-classification confidence level for each Document that is classified.

Create a New Classifier Model

To create a new Classifier Model:

Navigate to Process Design Studio, from the Enterprise Tools tab of the Control Panel.
From the left menu, select Document Classifier (under Analytics). Click New Classifier Model.
The New Document Classifier Model window will open.

Create/Edit Document Classifier Model

In order to edit a Model, go to the Document Classifier screen and click on the item's Edit button.

Fill in the Model Name
Fill in the Model description
Select the Document Library folder(s) where the training data is located.
1. See Prerequisites.
Select the Index Field “Document Type”
Select the Input Type
1. Image and Text: When the layout is consistent for every doc type, then select the “Image and Text” option.
2. Text: When there may be varying layouts for the same doc type, then select the “Text” input. When unsure, “Text” is the better default choice.
Click Save. The Created Classifier should now appear on the list, with its status set to Initializing.
Click Train

Evaluate Classification Quality

Review the model’s calculated Threshold value.
1. Threshold value = Minimum Document Confidence Level
  1. This is a value between 0 and 1 representing a percentage. The threshold should be greater than or equal to the confidence level. If low confidence, a new model with additional training data or accurate Doc Type labeling is required.
2. Upload a sample of customer documents that were not part of the training data and confirm the confidence level is >= Threshold value.
3. If low Document confidence, a new model with additional training data or accurate Doc Type labeling is required.
Review the Confusion Matrix Chart
1. Diagonal representation of Document Types is ideal
2. Look for the “True Label” to match the “Predicted Label”.
3. If multiple mis-matched Doc Types, a new model with additional training data or accurate Doc Type labeling is required.
Review the Scatter Plot Chart
1. Tight cluster of points = high accuracy
2. If you cannot “draw a circle” around a Document Type without selecting another Doc Type, a new model with additional training data or accurate Doc Type labeling is required.

Delete Document Classifier Model

In order to delete a Model, go to the Document Classifier screen.

Select an item from the list.
Click on the Delete icon in one row, or select multiple items on the list and then click on Delete Model(s) button.

User Groups and Permissions

Action	VaultAccess	VaultAdmin	Configuration Admin	Owner	Editor	Viewer
Set up prerequisite training data in the Document Library	✅	✅	✅	✅	✅	🚫
Create a New Classifier Model	✅	✅	✅	✅	🚫	🚫
Edit Document Classifier Model	✅	✅	✅	✅	🚫	🚫
Configure a Document Pipeline	🚫	🚫	✅	🚫	🚫	🚫
Evaluate Classification Quality	✅	✅	✅	✅	🚫	🚫
Document Classification Monitoring	✅	✅	✅	✅	🚫	🚫
Delete Document Classifier Model	✅	✅	✅	✅	🚫	🚫

FAQ

What types of files does the Document Classifier support?

Supported file types for classification include:

PDF, DOCX, XLSX, PPT, TXT, TIFF, and PNG (for classification jobs)
Note: PNG and PDF are the only file types supported in the Document Pipeline > Classification Job configuration.

What is the minimum amount of training data required to create a model?

While there's no strict minimum, it's recommended to use at least 100 labeled documents per Document Type. The more data you provide, the more accurate your model will be.

Do the training documents need to be manually indexed?

Yes. The training documents must have the "Document Type" index field pre-set before they are used to train the AI model. These labeled examples are crucial for the classifier to learn accurately.

Where do I upload my training documents?

Upload training files into a designated folder in your VisualVault Document Library. This folder is then referenced when setting up the model in the Document Classifier UI.

How do I improve the model’s classification accuracy?

Provide more labeled training documents
Ensure correct and consistent labeling of the "Document Type" field
Choose the correct Input Type (Image + Text vs. Text only)
Use evaluation tools like the Confusion Matrix and Scatter Plot to identify weaknesses

What is the Confidence Index Field and why is it important?

The Confidence Index Field stores the AI model’s confidence level (from 0 to 1) in its classification. It helps assess the reliability of a classification and can be used to filter or flag uncertain predictions.

What happens if the model classifies a document with low confidence?

If the confidence value is below the model's threshold, you should:

Add more or better-labeled training documents
Re-train the model
This ensures the model can better distinguish between similar Document Types.

What does the Threshold Value represent?

The Threshold is the minimum confidence level required for a prediction to be considered valid. A higher threshold ensures accuracy but may result in more unclassified documents; a lower threshold allows broader classification but risks more errors.

Can I use one model for multiple document libraries or customers?

No. Each model is customer-specific, trained on that customer’s own labeled data for best accuracy. You need to create and train separate models for each customer database.

How do I delete a Document Classifier model?

Navigate to the Document Classifier screen, select the model(s) from the list, and click the Delete icon or Delete Model(s) button.

What if the Document Pipeline option is missing in my Control Panel?

If Document Pipelines is not visible under Enterprise Tools, you must submit a support ticket to have the feature enabled for your customer database.

Should I use "Image and Text" or just "Text" as the input type?

Use Image and Text if document layouts are consistent across types.
Use Text if documents have varied layouts for the same type.
Default recommendation: Use Text when unsure.

Need help? Please contact your organization’s admininistrator, for troubleshooting. If the issue remains unresolved, VisualVault Support is available to assist.