Note order changes every export, filenames not deterministic #9

mbafford · 2021-01-21T13:55:54Z

mbafford
Jan 21, 2021
Collaborator

Problem

The current approach of iterating over the notes and assigning an incrementing ID has the problem of generating different core IDs every time you run the export.

This manifests over two subsequent runs without any changes on the Keep side, so the ordering the API is returning notes is effectively random.

So running twice you could end up with the following files:

0001 - cats.md (from run1)
0001 - dogs.md (from run2)
0002 - cats.md (from run2)
0002 - dogs.md (from run1)

Potential Solutions

There is a unique ID for each note, but it's really verbose (e.g. 15e1ff865e7.b01d13e55b1750fc). I did try that approach locally, and didn't like the resulting filenames.

Alternate ideas for uniquely prefixing each note:

Use the create timestamp (note.timestamps.created.timestamp()) directly? How much resolution does this actually have in the Keep API json?
Zettelkasten ID format - typically YYYYMMDDHHMMSS based on create date of the note (of course, it's possible for two notes, especially coming from different people) to be created in the same second, so this might require further logic to eliminate duplicates.
Sort the notes by [note.timestamps.created.timestamp(), note.id] prior to iterating. Assumes notes will never be created in the past. This has the side-benefit of making older notes have the lower numbers.

Using note.id is the only "safe" choice if you really want to uniquely identify a note, since you could gain new notes in the future if someone shares a note with you - and that note could be older than other notes you already have.

Further Challenges / Considerations

Any approach above that simply fixes the prefix won't account for notes changing titles on the Keep side. To fully account for that, you would have to find the existing note (if any) by ID, delete it, and write the new file with the same ID but the new title.

Context

This may be where my needs and yours diverge too much. I'm looking for a script I can run continuously and pull the latest Google Keep notes locally and basically keep a local mirror of my Google Keep notes. Google Keep is great for quick shared notes both in desktop browser and mobile, and I'm considering using it as part of my broader notes flow - as something I can reference from my notes, but not necessarily edit/take over from the local computer side. So it's possible I need to fork and write keep-mirror instead.

For this project, I think just implementing one of the three options above is the best approach for a simple solution to provide more consistency, without falling into the "sync/mirror" rabbit hole.

ndbeals · 2021-01-22T01:25:10Z

ndbeals
Jan 22, 2021
Maintainer

I transferred this to a discussion as it's more appropriate here (I only just enabled discussions, not your fault).

I don't think our needs diverge here, I too was looking to make a tool that I can use to continuously pull from Keep to keep a local directory. Truth be told I was hoping that the .all() list was deterministic (I hadn't noticed it wasn't). And I didn't really think about the case where the title in keep is changed.

The "sync/mirror" problem is relevant but only the pulling half of it, since this tool doesn't push notes at all. I believe an acceptable solution without needing a local database (or some persistent state storage) is possible, though if push comes to shove, I'm not against using a local sqlite database.

To answer some of your suggestions:

Using just the created timestamp is very possible. The resolution appears goes down to milliseconds. It's not impossible that two notes were created at the exact same millisecond, but it's so unlikely I'm willing to assume it'll never happen. (at least in the case of a single users possible shared notes)
Also a possibility, seeing how the resolution is milliseconds, I'm not concerned about duplicate times.
I liked this a lot until you mentioned it's possible to have old notes shared to you, which makes it less attractive than point 1 or 2.

Both note.id and the created timestamp are the only two that can't possibly change.

I've been thinking/experimenting with a fourth option, and that is to use the note.id but reducing it's length. The general idea is that we hash/checksum the id, convert that digest into an integer (digest must be 8 bytes or smaller), then encode that integer into base36. Seems convoluted but it results in deterministic, short ids that could be used instead of the note_count value. Here's it in action:

Base36 uses 0-9 + 26 alphabet characters to encode. It is case-insensitive.
Base62 uses 0-9 + 26 lowercase alphabet + 26 uppercase alphabet. It is case-sensitive and while it'd save space, I'm concerned about filesystem case-sensitivity idiosyncrasies.

Hashing: "176f22b6131.8543065a4bcbcaee"
blake-6     : base36:(10) 2de4d65k03     base62:( 9) 16pcaeDwT
blake-7     : base36:(11) bpm9z05d1ha    base62:(10) 3a74QSZah0

Hashing: "15e1ff865e7.b01d13e55b1750fc"
blake-6     : base36:( 9) p5r2heltx      base62:( 8) k9CcuaeF
blake-7     : base36:(11) 79wdjiq6a81    base62:(10) 1XOVfCFiz7

Hashing: "1611200197106.1984337983"
blake-6     : base36:(10) 1nkj4eauh7     base62:( 8) LIGTmOkH
blake-7     : base36:(11) f5x92c682ev    base62:(10) 45VUq00Bmf

Hashing: "1611198555749.547764417"
blake-6     : base36:( 9) dkefvjntq      base62:( 8) aROQUr5Y
blake-7     : base36:(11) 3o665asn6m8    base62:( 9) ZtH7CQyn6

The numbers in parenthesis (e.g "base36:(10)") is how many characters that encoding is. "blake-6" means a 6 byte digest output, "blake-7" is a 7 byte digest output. And the 7 byte (56 bits) output seemingly always comes out to 11 characters (from my observations in an expanded set I was testing). the 6 byte (48 bit) output varies in size a little.

Seeing how you were willing to go with YYYYMMDDHHMMSS as a prefix with its 14 characters, I think the 11 character length for a 7 byte digest is acceptable too. Instead of <note_count> - <note_title>.md as a file name, use <base36_note_id> - <note_title>.md>. This would ensure the notes always have a deterministic identifier. When updating an existing export, first build a map of base36_note_id -> note_file_object` by searching the output directory and enumerating the file listing. Then use that as a lookup table when updating/saving notes.

What're your thoughts on that?

1 reply

ndbeals Jan 22, 2021
Maintainer

Here's a table of collision likelihoods for 5, 6 and 7 bytes hashes.

Samples	5 bytes	6 bytes	7 bytes
10	1 in 24 Billion	1 in 6 Trillion	1 in 2 Quadrillion
100	1 in 220 Million	1 in 56 Billion	1 in 14 Trillion
1 Thousand	1 in 2 Million	1 in 560 Million	1 in 140 Billion
10 Thousand	1 in 21 Thousand	1 in 6 Million	1 in 1 Billion
20 Thousand	1 in 5 Thousand	1 in 1 Million	1 in 360 Million
40 Thousand	1 in 1 Thousand	1 in 350 Thousand	1 in 90 Million
80 Thousand	1 in 340	1 in 87 Thousand	1 in 22 Million
100 Thousand	1 in 220	1 in 56 Thousand	1 in 14 Million
1 Million	1 in 3	1 in 560	1 in 140 Thousand
10 Million	1 in 1	1 in 6	1 in 1 Thousand
100 Million	1 in 1	1 in 1	1 in 14

For reference, the chances of being struck by lightening is 1 in 560,000.

Seems that either 6 or 7 bytes would be just fine.

mbafford · 2021-01-22T14:55:22Z

mbafford
Jan 22, 2021
Collaborator Author

Proposal (tl;dr)

Make the ID format meaningful and configurable (based on the timestamp), but not unique, scan the frontmatter to index the notes and detect the true ID.

So general flow for an export/clone:

In-memory index of all frontmatter mapping to filenames
- can also index timestamps to determine if notes have changed in Keep since
In-memory index of all media, pulling the media IDs from the filenames
fetch and sort keep notes by create timestamp
determine necessary tasks for files / notes:
- rename local files to reflect updated keep titles
  - media, too
- delete (archive/"trash folder") files related to deleted keep notes
  - media, too
- download updated notes
- download updated / new media
- ignore (unchanged)

Truncated Hash

I'm comfortable with the unlikely chance of (truncated) hash collision - I've gone through the same sort of exercises as you did above before for other contexts. So that approach seems fine from that perspective. But I don't prefer that approach.

ID Options

For the file format, I prefer something meaningful or useful as much as possible, though - so:

Numerical ordering based on created timestamp gives you rough chronology, nothing else
- Most aesthetically pleasing, least useful
Created timestamp (epoch) gives you chronology, (almost) deterministic unique IDs, and rough hint to creation time/date
- It's ok, but not meaningfully smaller than YYYYMMDDHHMMSS, at the cost of usefulness
Formatted date/time (e.g. Zettel/ISO8601 without punctuation) gives you chronology and direct understanding of date/time
- It's also a common format for notetaking software and will resonate with a lot of people
- Easy to customize with built in time formatting functions
Google Keep ID gives you the exact note correlation, deterministic unique IDs, but little else.
- It's also ugly
Hash gives you (almost) deterministic unique IDs
- It's also ugly

Usefulness of the ID

Deterministic ID is useful for:

Meaningful viewing of the file names (as above) - file browser, shell
Overwriting the same file for the same Keep note when content changes
Providing obvious link between media files and notes
Renaming notes (and media?) when the Keep note title changes
Deleting files when a Keep note is deleted

ID from Frontmatter

As far as matching a file and its corresponding note, we could also just rely on the frontmatter. We have the unique Google Keep ID in the frontmatter of the file. Scanning the files should be so quick I don't see any reason not to just do it every time an export process runs.

Reading all of the notes and building an index shouldn't take as much time.

Benchmark

Here's a rough benchmark of 149 notes. Frontmatter scanned using:

def index_note_files(directory):
    index = {}
    for file in pathlib.Path(directory).glob("*.md"):
        with open(file, "r") as f:
            fm = frontmatter.load(f)
            index[fm.metadata['id']] = file
    return index

Step	Seconds (time.time())	Comments
Scan frontmatter	0.2859
Login	2.0396	Intentionally slow? Token faster?
keep.all()	0.0000	Did this actually happen during the login?
Download notes (no media, no logging)	0.3756

This is for 149 notes - so not a lot, but I can't imagine people will have more than a couple thousand notes in Google Keep, but I'm curious to be proven wrong! I took out image download from the main step since that's the least predictable portion.

If we needed to optimize (and I'd be surprised if we ever do) - you could use the filename ID as the first pass, identify potential matches (i.e. two note files with the same ID), and look at the frontmatter for those two notes. I'd just go for scanning the whole lot, though.

Local DB (e.g. SQLite)

I'm opposed to a SQLite DB for this until it's needed. If we did need one, I'd consider a caching layer more automatic/transparent, like karlicoss/cachew. (I love SQLite and use it for production workloads, when needed).

Design considerations:

Name collisions can still happen if anything other than Google Keep ID is used - so might as well just build the simple, but annoying logic of making notes with the name "Untitled-2"
Media file names are entirely opaque from Google Keep and we can't benefit from frontmatter.
- Most notetaking software I've been testing recently treats media as just something to give a unique ID to and stuff away.
- Might want to make a media sub-folder
- ID format in the filename probably should be the same as it is in Google Keep (note ID.media ID)
Do we accept the possibility of a note changing externally to the export process, or are notes considered read-only?
Do you allow mixing notes exported by keep-export and created by other means?
- If so, the ID metadata should be renamed to something like google-keep-id (I'm in favor of this regardless)
- If mixed, are all files scanned, or only ones with a keep-exporter style filename (probably have to scan all)
I'm against providing too many options, but note prefix seems like it's going to have strong personal preference - might be worth providing this as a format/template string config option
Can we rely on media with the same ID always having the same content?
- There's no hash that I saw, so is local existence enough to know to not re-download? How about file size?
  - I hesitate to use local filesystem timestamps in this context.
- Does the ID of a drawing change when it's updated?
At least Keep note don't inter-link so we don't have to worry about updating links between notes if files change. Image links are always in the same note, so when the note gets rewritten it could just point at the new image file names.
Need to store and re-use tokens rather than a full login each time

Uploading to Keep

You mentioned not wanting to support upload back to Keep, and I'm mostly in agreement with that. Someone's going to want it more strongly than I currently do, though.

Side annoyance

Part of this whole thing is really a consumer tooling problem. E.g. Obsidian.md doesn't read the files for titles when linking/opening notes, so you want the filename to contain enough info to make that process easier and look better. I place most of the blame and responsibility on Obsidian and similarly limited tools, but that doesn't make the problem disappear.

0 replies

mbafford · 2021-01-23T17:13:04Z

mbafford
Jan 23, 2021
Collaborator Author

I've implemented the majority of my proposal.

It's based on the assumption you'll start with a fresh download. It doesn't try to reconcile old format files or media downloads. So remove everything local and re-download.

Changes the metadata from id to google_keep_id to account for mixing notes exported by this tool and not.
Ignores notes without google_keep_id in the frontmatter
Allows for formatting the date prefix with standard date/time formats defaults to YYYY-MM-DD (--date-format)
Supports optional delete local markdown files that contain a Google Keep API not returned by the API (--delete-local)
Supports optional rename local markdown files to new filename based on updated date/time format, or keep title changing (--rename-local)
Moves media to notefolder/media/note.id/media.id.ext for easier indexing
Supports optional skipping of media that have the same local file size as remote (--skip-existing-media)
forces extension for image/jpeg from .jpe to .jpg

TODO:

Index existing media files, rename / delete
- (actually, renaming shouldn't need to happen, since the note and image IDs should be atomic)
Test audio file skip downloading existing logic
Find a way to detect unchanged drawings - there's no "bytesize" metadata for drawings
- 2021-01-27 - effectively implemented with notes being skipped if not changed
Skip unchanged notes (using timestamp in frontmatter)
More tests to ensure existing note files aren't overwritten unless they are the same google_keep_id

ISSUES:

If the same google_keep_id is present in multiple files, one will be updated/renamed, the rest will be ignored
De-dupe logic is hitting a unicode issue where a filename isn't seen as existing, but ends up being the same after writing to disk
Files with the same name and date will attempt to rename, but fail due to the other notes and keep same name - just noisy logging, not a problem

I ran black and isort per your hint on the prior PR.

3 replies

mbafford Jan 23, 2021
Collaborator Author

https://github.com/mbafford/keep-exporter/tree/delete_rename_local_index

mbafford Jan 23, 2021
Collaborator Author

Confirmed the same media file uploaded to different notes has a different media ID:

6c6f7d49d2e662b7602edf3d9c1f0285  ./media/1611438178489.345973797/1611439104566.1898452518.png
6c6f7d49d2e662b7602edf3d9c1f0285  ./media/1611439106452.135879366/1611439109993.936837624.png
6c6f7d49d2e662b7602edf3d9c1f0285  ./media/1611439111526.476820481/1611439118789.1130222670.png

mbafford Jan 23, 2021
Collaborator Author

Updated that branch further.

Index existing media files
(optionally) Delete media not in Keep anymore
Only update notes when the updated timestamp changes from the updated timestamp in the local note's frontmatter
- Confirmed updating drawings updates this timestamp (this is a good way to avoid double-downloading an unchanged drawing)
- Confirmed archiving, labeling, changing color, etc updates this timestamp
Added status counters at the end of the run
Added warning when two files reference the same Google Keep API - only one gets updated.

ndbeals · 2021-10-27T17:56:03Z

ndbeals
Oct 27, 2021
Maintainer

Thanks for the work!

I've merged your delete_raname_local_index branch and only had to do some minor refactoring!

merged by #21

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Note order changes every export, filenames not deterministic #9

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Note order changes every export, filenames not deterministic #9

mbafford Jan 21, 2021 Collaborator

Problem

Potential Solutions

Further Challenges / Considerations

Context

Replies: 4 comments · 4 replies

ndbeals Jan 22, 2021 Maintainer

ndbeals Jan 22, 2021 Maintainer

mbafford Jan 22, 2021 Collaborator Author

Proposal (tl;dr)

Truncated Hash

ID Options

Usefulness of the ID

ID from Frontmatter

Benchmark

Local DB (e.g. SQLite)

Design considerations:

Uploading to Keep

Side annoyance

mbafford Jan 23, 2021 Collaborator Author

mbafford Jan 23, 2021 Collaborator Author

mbafford Jan 23, 2021 Collaborator Author

mbafford Jan 23, 2021 Collaborator Author

ndbeals Oct 27, 2021 Maintainer

mbafford
Jan 21, 2021
Collaborator

Replies: 4 comments 4 replies

ndbeals
Jan 22, 2021
Maintainer

ndbeals Jan 22, 2021
Maintainer

mbafford
Jan 22, 2021
Collaborator Author

mbafford
Jan 23, 2021
Collaborator Author

mbafford Jan 23, 2021
Collaborator Author

mbafford Jan 23, 2021
Collaborator Author

mbafford Jan 23, 2021
Collaborator Author

ndbeals
Oct 27, 2021
Maintainer