-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Folia text validation on corrections #75
Comments
This is a complicated issue that we need to address properly, so I'll type it out in full also for my own image forming. I think we're close to one of FoLiA's boundaries here: in principle FoLiA does not support multiple tokenisations of the same text. A text has one tokenisation and all further linguistic annotations that are based on tokens use the same ones. This is a deliberate limitation/simplification as things get complicated and messy if there are multiple often conflicting tokenisations. (Other formats may do this, by letting each linguistic annotation explicitly refer to character offsets of the original text, essentially letting each annotation layer define its own 'tokens', if I can put it like that). Having said that, there is the
Second, we have text redundancy in FoLiA, i.e. it is possible to express text on multiple levels (e.g. sentence level and word level). If there is text on multiple levels, we have our text consistency rule which is checked by our libraries (as we notice in this issue): it enforces that text a higher level must always be consistent with the text on a deeper level. Text on a higher level than the token level is by definition untokenised. A third feature of FoLiA is that we support multiple text layers. Instead of just a single text reading of a certain structural element, there may be multiple. Consider a text layer right after an OCR system, and one after normalisation. Multiple text layers are each identified by their own class (if the class is omitted, which is almost always is if only one text layer exists, the default class "current" is assigned). The fact it's called current alludes to the fact that it is the most current reading (as opposed to something that is corrected), so it still carries some special meaning (as one of the rare exceptions in FoLiA since we never predefine any other classes). In this issue, we see all of these three features combined, and a potential conflict lurks in the woods: we can express multiple text layers at the higher level (sentence) but we can not express multiple tokenisations for each of those text layers, only one. I understand your goal is to relate the two text levels, or tokens therein, to eachother. When it comes to corrections, the "current" text class still does trigger some special behaviour, when you reverse your example by using the "current" class instead of "out" and something like "in" for the original pre-corrected text, then things already look better: <?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
<metadata type="native">
<annotations>
<correction-annotation />
<text-annotation />
<sentence-annotation />
<token-annotation />
</annotations>
</metadata>
<text xml:id="bug">
<s xml:id="s.1">
<t class="in">Dit is een test</t>
<t>Dit is test</t>
<w xml:id="w.1">
<t class="in">Dit</t>
<t>Dit</t>
</w>
<w xml:id="w.2">
<t class="in">is</t>
<t>is</t>
</w>
<correction>
<original>
<w xml:id="w.3">
<t class="in">een</t>
</w>
</original>
<new/> <!-- you didn't have this, not required as it was assumed but I'd rather make it explicit -->
</correction>
<w xml:id="w.4">
<t class="in">test</t>
<t>test</t>
</w>
</s>
</text>
</FoLiA> So, like you intended, in this situation we have a token w.3 that only exists in the original input (it doesn't
foliavalidator accepts it, but not out of wisdom, it also accepts it if I change w.3 for textclass "in" to "geen", creating inconsistent text. The reason is: I simply don't do proper text validation when the situation gets overly complex (see https://github.com/proycon/foliapy/blob/master/folia/main.py#L1285) . In this case things are still unambiguous though, only when corrections get nested it becomes truly irresolvable to get the original text (as there is no longer a single one).
folialint seems to stumble on a text delimiter issue (the space from w.2 seems forgotten because there's a correction in between it seems, but that technically is another bug than the one we are discussing and it looks as if the concistency check would pass too if that were fixed): But, I would say this example is indeed valid as none of our rules are broken. We have one tokenisation (without w.3), I also worked out a bit of a more sensible example that might be closer to a real use case and should be valid, but it essentially does the <FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
<metadata type="native">
<annotations>
<correction-annotation />
<text-annotation />
<sentence-annotation />
<token-annotation />
</annotations>
</metadata>
<text xml:id="issue75a">
<s xml:id="s.1">
<t>Dit is een test</t>
<t class="ocr">D!t 1S een @#~ tezt</t>
<w xml:id="w.1">
<t>Dit</t>
<t class="ocr">D!t</t>
</w>
<w xml:id="w.2">
<t>is</t>
<t class="ocr">1S</t>
</w>
<w xml:id="w.3">
<t>een</t>
<t class="ocr">een</t>
</w>
<correction>
<original>
<w xml:id="w.coffeestain">
<t class="ocr">@#~</t>
</w>
</original>
<new/>
</correction>
<w xml:id="w.4">
<t>test</t>
<t class="out">test</t>
</w>
</s>
</text>
</FoLiA> So, is the problem solved? Not sure yet. Let's try an insertion as well: <FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
<metadata type="native">
<annotations>
<correction-annotation />
<text-annotation />
<sentence-annotation />
<token-annotation />
</annotations>
</metadata>
<text xml:id="issue75a">
<s xml:id="s.1">
<t>Dit is een test</t>
<t class="ocr">D!t 1S tezt</t>
<w xml:id="w.1">
<t>Dit</t>
<t class="ocr">D!t</t>
</w>
<w xml:id="w.2">
<t>is</t>
<t class="ocr">1S</t>
</w>
<correction>
<original/>
<new>
<w xml:id="w.3">
<t>een</t>
</w>
</new>
</correction>
<w xml:id="w.4">
<t>test</t>
<t class="ocr">tezt</t>
</w>
</s>
</text>
</FoLiA> Both validators accept it, and I would say it's valid too. So that probably solves this issue and your solution with Like If you really want to avoid the complexity of <s xml:id="s.1">
<t>Dit is een test</t>
<t class="ocr">D!t 1S tezt</t>
<str xml:id="str.1">
<t offset="0">Dit</t>
<t offset="0" class="ocr">D!t</t>
</str>
<str xml:id="str.2">
<t offset="4">is</t>
<t offset="4" class="ocr">1S</t>
</str>
<str xml:id="str.4">
<t offset="11">test</t>
<t offset="7" class="ocr">tezt</t>
</str>
<str xml:id="str.3"> <!-- I'm deliberately messing with the ordering here to emphasise that it has no meaning with strings-->
<t offset="7">een</t>
</str>
<!-- and below an extra string example to emphasise that strings are not tokens: this overlaps with str.1 and str.2) -->
<str xml:id="str.bonus">
<t offset="3">t is</t>
<t offset="3" class="ocr">t 1S</t>
</str>
</s> |
Thanks for the long story.
|
The fact that "current" is a predefined class, unlike any in FoLiA, has always be a bit inelegant indeed. Some mechanism to let the user determine the name of the most current class and/or the default class might indeed be a nice enhancement. The best place for that would be in the declarations block I think, can probably be done with an XML attribute. Still, even without that, you should be able to get by in the current situation by simple renaming the classes.
Agreed, I think TICCL performs a role similar (but more complex) to a tokeniser and as such should produce tokens. |
I'm doing some initial thinking on this, and I'd say adding "defaulttextclass" and "currenttextclass" attributes to the <text-annotation set="..." defaulttextclass="current" currenttextclass="current" /> I'm deliberately splitting the default and current notion for extra flexibility. The above values would also be the default if the attributes wouldn't be set explicitly. |
Or perhaps |
This sounds feasible. having "current" as the default means that older documents are still valid. (or weren't in the first place) |
Ans I assume that the default for defaultclass is currentclass (or VV) |
yes |
Ok, so it seems good to implement the |
Given this FoLiA. which contains a deletion correction,
foliavalidator gives this result:
folialint also doesn't like this:
So folialint chokes on the 'current' textclass
foliavalidator on the 'none' class. Probably 'current' too? although de text seems to belong to 'out'
anyway. lot's of trouble on both sides....
The text was updated successfully, but these errors were encountered: