Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Idea: Document Management #259

Open
toliver38 opened this issue Jan 7, 2025 · 3 comments
Open

Feature/Idea: Document Management #259

toliver38 opened this issue Jan 7, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@toliver38
Copy link

toliver38 commented Jan 7, 2025

Users transitioning from tools like RagFlow like document management features alongside Knowledge Graph capabilities. While TrustGraph includes document loading through the Workbench and metadata integration, it lacks tools to manage ingested documents effectively.

Discussed briefly here: Discord Discussion

Problem

TrustGraph does not provide:

  • A way to list or search ingested documents.
  • Tools to delete, move, or rename documents within collections.
  • Status visibility for ingested documents.

These gaps make it difficult to curate and organize knowledge bases effectively.

Proposed Features

1. CLI Tools for Document Management

  • tg-list-documents -c <collection>: Lists all documents in a specified collection.
  • tg-delete-document -c <collection> -f <document_id>: Deletes a document by its ID within a collection.
  • tg-rename-document -f <document_id> -n <new_name>: Renames a document.

2. Workbench GUI Enhancements

  • Add document listing with search and sorting capabilities.
  • Provide options to delete, move, rename, or tag documents in an intuitive interface.

3. API Endpoints to Support Document Management

List Documents
  • Endpoint: GET /api/v1/documents
  • Description: Retrieves a list of all documents in a specified collection.
  • Query Parameters:
    • collection: The name of the collection to list documents from (optional).
  • Response:
    [
      {
        "document_id": "abc123",
        "name": "example.pdf",
        "metadata": { "key": "value" },
        "collection": "default",
        "user": "trustgraph"
      },
      {
        "document_id": "xyz456",
        "name": "example2.pdf",
        "metadata": { "key": "value" },
        "collection": "default",
        "user": "trustgraph"
      }
    ]
Delete Document
  • Endpoint: DELETE /api/v1/documents/{document_id}
  • Description: Deletes a document by its unique ID.
  • Path Parameters:
    • document_id: The ID of the document to delete.
  • Response:
    { "message": "Document deleted successfully." }
Rename Document
  • Endpoint: PUT /api/v1/documents/{document_id}/rename
  • Description: Renames a document by its unique ID.
  • Path Parameters:
    • document_id: The ID of the document to rename.
  • Body:
    { "new_name": "new_document_name.pdf" }
  • Response:
    { "message": "Document renamed successfully." }

Concern

I've found deleting nodes in graph-based tools can be resource-intensive, especially when nodes have complex relationships. This may impact TrustGraph, depending on the scale of the graph for each document and the efficiency of the deletion process.

@JackColquitt
Copy link
Contributor

Have you looked into or been using knowledge cores?

https://trustgraph.ai/docs/cores/

A lot of the data management you're talking about is on the roadmap for our knowledge core approach. Also, "knowledge core" is very much a placeholder term, so open to suggestions. 😆

@toliver38
Copy link
Author

I really like the Knowledge Core concept. Its really helpful for reuse. After playing with it a bit I started to see the possibility of sharing different subject matter expert knowledge cores between organizations.

My issue is the other day I uploaded about 100 pdf files and after some testing and evaluation I wanted to remove the documents from the backend as I deemed they were irrelevent to the collection I was working with.

On another occasion I wasn't able to easily list the documents by name that were in the store already so I wasn't sure if some of the documents had been processed and some had not. After digging into the logs I found out the pipeline was halted due to a malformed pdf. This is what motivated #243

@JackColquitt JackColquitt added the enhancement New feature or request label Jan 7, 2025
@JackColquitt
Copy link
Contributor

I don't know if you seen the word "collection" at some points in the code base, but as @cybermaggedon alluded to here:

https://discord.com/channels/1251652173201149994/1251652174270959798/1326535071187992648

we've been planning on this ability to manage data in the system for quite a while. This approach enables many features that improve data storage management all the way to providing a scheme for controlling data access management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants