-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search needs improvements and an API #15
Comments
This merge Lyrics/lyrics-database#217 has fixed the issue with Oleander's song, please take a look. Now those links to search.html that you listed above work. Here's what's going on: the lyrics.github.io website is essentially a collection of static .html files, since GitHub only allows to host non-dynamic (no execution on the server) files. The search is implemented by querying the GitHub's API with the lyrics repo being the target. It has more flaws than the fact that it doesn't include file names into search results by default (e.g. there's a limit on the amount of requests users who aren't currently logged in into their github.com can make, like 10 search requests per minute or so). GitHub's search engine is optimized for searching through large amounts of code, which works for finding lyrics, but in order to find something by a file/album/artist name, that information needs to be within the file (which metadata takes care of). Regarding Deadbeef and other players, the original idea that I've had was to download the whole copy of lyrics.git to be stored locally, and then search for lyrics not via a remote API, but rather using metadata within the locally stored database. Aside from privacy and speed, this approach is more reliable (e.g. any smartphone today could search through hundreds of megabytes of lyrics stored locally, not needing any Internet connection). But valid point about the API, perhaps it could be added if this projects gets its own server and a domain name, however it's not exactly there yet... lyrics.git needs work and become popular enough in order to justify that transition from the free GitHub pages hosting that it's currently using. |
Nice that the Oleander song is fixed, but as you pointed out, practically any other song has the same issue right now. If you add the functionality, I'll gladly host it if my VPS can manage (I don't see why it wouldn't at this point with the DB being tiny 3.5K files).
That's an interesting approach. One issue I see with this is the number of files. At this point it's 3.5K~, but if this project gets bigger, having XXXk number of small files could become an issue (just try copying such amount of files to a microSD card for example, it's going to take a long while, and compiling it into a single big database might prove better. |
As a temporary solution, I could write a shell script to crawl the database, add metadata to every file that is necessary to make the search by album/artist/title possible. As another solution, the search logic on the website itself could be improved to parse Since the database is open and available in full, please feel free to create an API server and host it on your server, we could host the source code repo within the Lyrics organization and share the link to your server's API endpoint for others to use. Currently the database lacks enough metadata, but at least searching by artists/albums/titles using file names could be implemented this way. It could also be possible to leverage the GitHub's code search engine a little more, e.g. looking for Regarding using lyrics.git as a local database, I'd think by the time it becomes a popular source of lyrics, it's likely to contain hundreds of thousands, or even millions of compositions. At that point it'll be only possible to work with that amount of data latency-free if converted into a sqlite (or any other) database, which I believe many apps/programs already do when it comes to working with large amounts of data (since most filesystems have a high overhead and many other limitations when it comes to working with thousands of files or more). |
It's not that bad with modern filesystems (see https://unix.stackexchange.com/questions/28756/what-is-the-most-high-performance-linux-filesystem-for-storing-a-lot-of-small-fi), though a database should indeed be more suitable for search. I like the idea of using the database locally, but here's a few things to consider regarding an online service's API:
Putting all those together, the API may be all HTTP-based, serve as a web interface (and maybe a SPARQL endpoint) at once, and only consist of search with optional parameters. But with a few different representations (alternate versions) in order to be easily usable for different music players. And not sure which DBMS to use. |
A bit more on DBMSes: querying with SPARQL or SQL would also let clients to easily switch between using a remote HTTP server and a local database (using the same queries). While librdf supports file-based storage modules (including an SQLite database, http://librdf.org/docs/api/redland-storage-modules.html), and can be used to read RDFa data from web pages. Combined with the music ontology, this may let us to avoid duplication and reinvention of a few things, while still let to distribute the database as a single portable file. |
@defanor do you want to take a stab at creating the API? |
I can at least implement a prototype, but first we need to figure what the interface itself should be like. For instance, while I like RDF, it may be rather obscure, and while using SPARQL directly for querying has its advantages, we'll be stuck with it then (which is perhaps worse than being stuck with SQL or with arbitrary URL query parameters). Another issue with SPARQL is the lack of quick text search: it should be fine if we'll match on artist name first (aliases can be used easily, and then there will be just a few song titles to check with I've looked up Debian package installation statistics (https://popcon.debian.org/main/by_inst) now, to get a rough estimate of different DBMS libraries being already available on client machines (for the local database access), here they are:
So, while libsqlite3 is installed almost everywhere (and wouldn't be a burden as a dependency), librdf seems to be already present on more than a half of the systems (similar to Apparently we'd have to choose which of the listed interface and implementation properties to go after. |
I forgot: xapian has query parser that will probably be usable on its own. So could achieve the same as with SPARQL or SQL: same queries for local lookups and requests to an online service. Though then it'd be full-text search, and while it's customizable, not sure if it can be told to just ignore (or assign low weight to) any "(live in <place>)" marks. Update: though using the same queries isn't that important, especially if it won't be SPARQL: xapian's querying language is rather far from standard for data querying (neither will the field names be standard, while with the music ontology and SPARQL the whole API could be described just as "a SPARQL endpoint serving the music ontology data"), and querying from this database doesn't have to be flexible anyway. |
This is the first time I heard about SPARQL/RDF and I had to look it up. It could be a better idea to base on something more common like SQL, but I don't really have much of an idea about intricacies of either. What are we actually solving here local-wise? Are there music players that are capable, without plugins/modifications to query local files/DB for lyrics and other metadata? You're saying "same queries for local lookups and requests to an online service." which I imagine leads to a 'no'. Sorry for basic questions, I just have no idea about the state of music players. Btw http://musicontology.com is down right now, and Google cache doesn't seem to load, might be a good idea to post a copy of it in here when it's back up. |
Yes, as mentioned above, it's quite obscure. Though librdf is installed surprisingly commonly, it seems, and clients don't really have to master it in order to issue typical queries. But there are other reasons why it may not be suitable.
I'm not aware of such, but what I think would be nice for plugins/modifications is to be easy to write, and to require minimal additional dependencies (preferably nothing large, and something that's usually already installed). That should increase chances of local database adoption (not just the API). Maybe they shouldn't be mixed together at all, but it would be nice (easier) to maintain a single database/schema for both, and the same functionality is needed for both. |
To summarize DBMS/storage/search options:
So, there are distinct groups: the ones with good (flexible and indexed) search (which we don't really need yet, but hopefully will grow enough to need it someday), and the ones that are relatively easily embeddable into arbitrary software (which we don't need at once either, but also aim). Full-text search is only needed for search by song name alone, which is only required on the website. Otherwise strict matching on artist name/aliases should narrow the records sufficiently to just go through them (somehow, not sure how exactly) without a fancy index, although FTS could still be useful for finding the closest match among available records. As an alternative to full-text search, we could consider some basic custom preprocessing (e.g., dropping everything after the first opening paren, lower-casing the titles, etc), but it's likely to be quite unreliable. FTS wouldn't be perfect either, and I guess it's not realistic to set aliases for every song title variation. |
Turns out FTS5 (the full-text search extension) is included into SQLite since version 3.9. It's not in CentOS yet, but already in Debian stable, and likely in other common distributions. So, SQLite looks like a pretty good option. |
Trying it now, and apparently a bit of query preparation (and/or preprocessing) will be needed anyway: we can't just require all the terms (because of those marks), and simply using Going to fill a database with lyrics from files tomorrow, then we could poke it more, see how it works, and perhaps proceed to a website/API then. |
Test schema (just search fields, no lyrics): {
FS = "/";
if ($4 != "") {
gsub(/'/, "''");
printf "INSERT INTO lyrics VALUES ('%s', '%s', '%s');\n", $2, $3, $4
}
} Can be invoked with We'd probably need much more data and check it with actual playlists to be sure, but so far seems to work fine with queries akin to the one above (e.g., |
The "initial token queries" should be useful for longest prefix search, but they don't seem to be supported in sqlite 3.16.2 (the one in Debian stable). Regarding further steps: we'll need to decide which language(s) to use for the website/API, as well as for the database filling tool. I'm thinking of Perl (perhaps with CGI, and with XSLT for templating): not that I like it or used in the past few years, but it's commonly available, and should be easily editable by everyone. And still less awkward/error-prone than longer shell scripts. Though I'm open to suggestions. |
@defanor if you'd like I can spin you up a container with Arch (sqlite 3.26.0) with SSH and HTTP/HTTPS access. |
Thanks, but no need (at least not until we'll have something to deploy):
I've just built 3.26 for development here, and apart from "disconnect",
it seems to work with the old Perl module. And I could always run a
local VM, but it's more comfortable without those.
I'm going to poke relevant Perl modules today and tomorrow (sqlite and
libxml/libxslt bindings, CGI), and see how it goes.
|
Wrote a rough script to fill an sqlite database with lyrics, it's 4 MB with 2488 texts (with Also need to decide whether to put those into the "lyrics" repository, or a separate one. @vflyson, what do you think? |
For different forms themselves, I think a lot of it would get covered by converting everything down to the base character like so, so songs with weird unicode(ish) characters would have two entries, one with the accents, one without. However at least in PHP this was extremely inefficient to do (not that I tried to optimize it)
|
FTS5 takes care of that (tried it with Abwärts/Abwarts,
Dödelhaie/Dodelhaie), but I'm more worried about the kinds of different
spelling that can't be dealt with using just character substitution: for
instance, some bands change names, but tags don't necessarily reflect
those properly. Some may also be commonly abbreviated. And
transliteration between some languages is more tricky than that.
Meantime, I've composed a basic search script that expands queries to
look for the longest prefix, and perhaps will proceed to the web bits
next. Most likely will have to get back to search later (tweak the
queries themselves and/or preprocessing), but it would be nice to have a
working API prototype.
|
Created the https://github.com/Lyrics/lyrics-api repository, perhaps we could continue the API/DB discussion in issues there. The initial CGI script is ready (though didn't push yet, trying to figure which license to use: AGPL seems appropriate for web-facing software, LGPL/BSD/MIT would be more suitable for libraries if we'll have any, or if any would be based on the CGI script), currently it just serves up to 10 best matches. But probably should serve listings if no song title at all is provided, and/or if there's a bunch of results with roughly the same rankings. Well, there's many small things like that, will leave them for issues in that repository. |
Can just release with the more 'draconic' licenses and always relicense to different ones later if needed. |
Indeed. Licensed under AGPL and pushed.
|
The database has been updated to contain basic metadata for all lyrics. An automated test to ensure this doesn't become an issue again in the future is in the making. |
The aforementioned automated test is in place and lyrics.git now has Travis CI hooked up to it to check every commit to ensure the database retains its consistency. |
1. Search is lacking
At the moment I can't even search by title or album, much less both via some API.
There's Oleander
https://lyrics.github.io/db/o/oleander/
with album Joyride
https://lyrics.github.io/db/o/oleander/joyride/
with a song Runaway train
https://lyrics.github.io/db/o/oleander/joyride/runaway-train/
Yet the only working search from these is the last one.
https://lyrics.github.io/search.html?q=oleander
https://lyrics.github.io/search.html?q=joyride
https://lyrics.github.io/search.html?q=runaway%20train
2. Search API
For the API part, I'd for example like to have the lyrics show up in Deadbeef. For that there'd need to be a plugin that'd work via an API. There are plugins for sites like lyrics.wikia.com, and I can see they use an API that searches for both artist and the song.
https://github.com/loskutov/deadbeef-lyricbar/blob/master/src/utils.cpp#L25
No API is a show stopper IMO, this project would be infinitely more useful if things like music players could rely on it.
3. Translations
Looks like the /Lyrics repo has a translations folder but it doesn't seem to be supported at all by the site?
The text was updated successfully, but these errors were encountered: