-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Unicode Case Folding #229
base: main
Are you sure you want to change the base?
Changes from 4 commits
b376e37
f0bd02f
cfebcb0
9ca6c36
98f82e7
6ed8433
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
package org.typelevel.ci | ||
package bench | ||
|
||
import org.scalacheck._ | ||
import org.typelevel.ci.testing.arbitraries._ | ||
import cats._ | ||
import org.openjdk.jmh.annotations._ | ||
import java.util.concurrent.TimeUnit | ||
|
||
@State(Scope.Thread) | ||
@BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime)) | ||
@OutputTimeUnit(TimeUnit.MILLISECONDS) | ||
class CaseFoldedStringBench { | ||
|
||
var currentSeed: Long = Long.MinValue | ||
|
||
def nextSeed: Long = { | ||
val seed = currentSeed | ||
currentSeed += 1L | ||
seed | ||
} | ||
|
||
def nextString: String = | ||
Arbitrary.arbitrary[String].apply(Gen.Parameters.default, rng.Seed(nextSeed)).getOrElse(throw new AssertionError("Failed to generate String.")) | ||
|
||
def nextListOfString: List[String] = | ||
Gen.listOf(Arbitrary.arbitrary[String])(Gen.Parameters.default, rng.Seed(nextSeed)).getOrElse(throw new AssertionError("Failed to generate String.")) | ||
|
||
@Benchmark | ||
def caseFoldedStringHash: Int = | ||
CaseFoldedString(nextString).hashCode | ||
|
||
@Benchmark | ||
def caseFoldedStringFoldMap: CaseFoldedString = | ||
Foldable[List].foldMap(nextListOfString)(CaseFoldedString.apply) | ||
|
||
@Benchmark | ||
def stringHash: Int = | ||
nextString.hashCode | ||
|
||
@Benchmark | ||
def stringFoldMap: String = | ||
Foldable[List].foldMap(nextListOfString)(identity) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
package org.typelevel.ci | ||
|
||
import cats._ | ||
import cats.kernel.LowerBounded | ||
import org.typelevel.ci.compat._ | ||
import scala.annotation.tailrec | ||
|
||
/** A case folded `String`. This is a `String` which has been converted into a | ||
* state which is suitable for case insensitive matching under the Unicode | ||
* standard. | ||
* | ||
* This type differs from [[CIString]] in that it does ''not'' retain the | ||
* original input `String` value. That is, this is a destructive | ||
* transformation. You should use [[CaseFoldedString]] instead of | ||
* [[CIString]] when you only want the case insensitive `String` and you | ||
* never want to return the `String` back into the input value. In such cases | ||
* [[CaseFoldedString]] will be more efficient than [[CIString]] as it only | ||
* has to keep around a single `String` in memory. | ||
* | ||
* Case insensitive `String` values under Unicode are not always intuitive, | ||
* especially on the JVM. There are three character cases to consider, lower | ||
* case, upper case, and title case, and not all Unicode codePoints have all | ||
* 3, some only have 2, some only 1. For some codePoints, the JRE standard | ||
* operations don't always work as you'd expect. | ||
* | ||
* {{{ | ||
* scala> val codePoint: Int = 8093 | ||
* val codePoint: Int = 8093 | ||
* | ||
* scala> new String(Character.toChars(codePoint)) | ||
* val res0: String = ᾝ | ||
* | ||
* scala> res0.toUpperCase | ||
* val res1: String = ἭΙ | ||
* | ||
* scala> res0.toUpperCase.toLowerCase == res0.toLowerCase | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These methods are dangerous without an explicit Locale. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's part of why my #232 implementation intentionally bypasses any JRE based string conversion and java.util.Locale stuff. For example, from my reading on the Turkic rules around There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I was just pointing out that calling these methods without an explicit Locale is setting a bad example for others. I wish those overloads were deprecated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see. I'll make sure anything like that is updated/removed in the final version. |
||
* val res2: Boolean = false | ||
* | ||
* scala> Character.getName(res0.head) | ||
* val res3: String = GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI | ||
* | ||
* scala> res0.toUpperCase.toLowerCase.equalsIgnoreCase(res0.toLowerCase) | ||
* val res4: Boolean = false | ||
* }}} | ||
* | ||
* In this example, given the Unicode character \u1f9d, converting it to | ||
* upper case, then to lower case, is not equal under normal String | ||
* equality. `String.equalsIgnoreCase` also does not work correctly by the | ||
* Unicode standard. | ||
* | ||
* Making matters more complicated, for certain Turkic languages, the case | ||
* folding rules change. See the Unicode standard for a full discussion of | ||
* the topic. | ||
* | ||
* @note For most `String` values the `toString` form of this is lower case | ||
* (when the given character has more than one case), but this is not | ||
* always the case. Certain Unicode scripts have exceptions to this and | ||
* will be case folded into upper case. If you want/need an only lower | ||
* case `String`, you should call `.toString.toLowerCase`. | ||
* | ||
* @see [[https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G21790]] | ||
*/ | ||
final case class CaseFoldedString private (override val toString: String) extends AnyVal { | ||
|
||
def isEmpty: Boolean = toString.isEmpty | ||
|
||
def nonEmpty: Boolean = !isEmpty | ||
|
||
def length: Int = toString.length | ||
|
||
def size: Int = length | ||
|
||
def trim: CaseFoldedString = | ||
CaseFoldedString(toString.trim) | ||
|
||
private final def copy(toString: String): CaseFoldedString = | ||
CaseFoldedString(toString) | ||
} | ||
|
||
object CaseFoldedString { | ||
|
||
/** Create a [[CaseFoldedString]] from a `String`. | ||
* | ||
* @param turkicFoldingRules if `true`, use the case folding rules for | ||
* applicable to some Turkic languages. | ||
*/ | ||
def apply(value: String, turkicFoldingRules: Boolean): CaseFoldedString = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about just calling this one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know I've created a bit of a mess by creating this and #232, but in that one I create separate types for Turkic and non-Turkic variants. I think if we support Turkic variants, we need to have separate types. I'm concerned that allowing two There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think they would need to be separate types. I guess we can support them, but it's locale-like behavior, and I'm not sure where it ends. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well...if we want to support Simple/Full Default/Canonical/Compatibility caseless matching, I feel like the Turkic variants aren't really the primary issue. Unfortunately, it sounds like there are quite real use cases for most of these permutations 🙁 |
||
val builder: java.lang.StringBuilder = new java.lang.StringBuilder(value.length * 3) | ||
val foldCodePoint: Int => Array[Int] = | ||
if (turkicFoldingRules) { | ||
CaseFolds.turkicFullCaseFoldedCodePoints | ||
} else { | ||
CaseFolds.fullCaseFoldedCodePoints | ||
} | ||
|
||
@tailrec | ||
def loop(index: Int): String = | ||
if (index >= value.length) { | ||
builder.toString | ||
} else { | ||
val codePoint: Int = value.codePointAt(index) | ||
foldCodePoint(codePoint).foreach(c => builder.appendCodePoint(c)) | ||
val inc: Int = if (codePoint >= 0x10000) 2 else 1 | ||
loop(index + inc) | ||
} | ||
|
||
new CaseFoldedString(loop(0)) | ||
} | ||
|
||
/** Create a [[CaseFoldedString]] from a `String`. | ||
* | ||
* @note This factory method does ''not'' use the Turkic case folding | ||
* rules. For the majority of languages this is the correct method of | ||
* case folding. If you know your `String` is specific to one of the | ||
* Turkic languages which use special case folding rules, you can use | ||
* the secondary factory method to enable case folding under those | ||
* rules. | ||
*/ | ||
def apply(value: String): CaseFoldedString = | ||
apply(value, false) | ||
|
||
val empty: CaseFoldedString = | ||
CaseFoldedString("") | ||
|
||
implicit val hashAndOrderForCaseFoldedString: Hash[CaseFoldedString] with Order[CaseFoldedString] = | ||
new Hash[CaseFoldedString] with Order[CaseFoldedString] { | ||
override def hash(x: CaseFoldedString): Int = | ||
x.hashCode | ||
|
||
override def compare(x: CaseFoldedString, y: CaseFoldedString): Int = | ||
x.toString.compare(y.toString) | ||
} | ||
|
||
implicit val orderingForCaseFoldedString: Ordering[CaseFoldedString] = | ||
hashAndOrderForCaseFoldedString.toOrdering | ||
|
||
implicit val showForCaseFoldedString: Show[CaseFoldedString] = | ||
Show.fromToString | ||
|
||
implicit val lowerBoundForCaseFoldedString: LowerBounded[CaseFoldedString] = | ||
new LowerBounded[CaseFoldedString] { | ||
override val partialOrder: PartialOrder[CaseFoldedString] = | ||
hashAndOrderForCaseFoldedString | ||
|
||
override val minBound: CaseFoldedString = | ||
empty | ||
} | ||
|
||
implicit val monoidForCaseFoldedString: Monoid[CaseFoldedString] = | ||
new Monoid[CaseFoldedString] { | ||
override val empty: CaseFoldedString = CaseFoldedString.empty | ||
|
||
override def combine(x: CaseFoldedString, y: CaseFoldedString): CaseFoldedString = | ||
new CaseFoldedString(x.toString + y.toString) | ||
|
||
override def combineAll(xs: IterableOnce[CaseFoldedString]): CaseFoldedString = { | ||
val sb: StringBuilder = new StringBuilder | ||
xs.iterator.foreach(cfs => sb.append(cfs.toString)) | ||
new CaseFoldedString(sb.toString) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This encoding eagerly folds Strings that may never have
equals
orhashCode
called on them, and roughly double the memory per instance.The current encoding caches the hash code on first use, but inside a Map would require refolding the String to test equality.
I can image either strategy winning in some contexts. But what if
asCaseFoldedString
were lazy: we'd fold zero times when we can, once when we must. Lazy could either be a lazy val, or just a null-init pattern similar to the zero-init pattern already here on hashCode.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rossabaker yeah, that makes sense. I actually benchmarked def/lazy val/val. lazy val seemed like a reasonable tradeoff. I was hesitant because I think a lot of our use cases are hashCode/equality heavy, but I can make it a
lazy val
or use the null-init pattern.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can stomach the var, null-init is the fastest unless multiple threads initialize it concurrently, which should be rare. This is how
String#hashCode
operates.