Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the new minhash 25.02 api #445

Merged

Conversation

praateekmahajan
Copy link
Collaborator

@praateekmahajan praateekmahajan commented Dec 20, 2024

Description

Solves #426 and piggy backs on https://github.com/NVIDIA/NeMo-Curator/pull/442/files

Version API Speed / Implementation
24.10 minhash(a) Slow (Old)
24.12 minhash_permuted(a,b) Fast (new)
25.02 minhash(a,b) Fast (new)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>
@praateekmahajan praateekmahajan added the gpuci Run GPU CI/CD on PR label Dec 20, 2024
Comment on lines -42 to +50
# TODO remove this once 24.12.0 becomes the base version of cudf in nemo-curator
MINHASH_PERMUTED_AVAILABLE = CURRENT_CUDF_VERSION >= parse_version("24.12.0") or (
CURRENT_CUDF_VERSION.is_prerelease
and CURRENT_CUDF_VERSION.base_version >= "24.12.0"
# TODO remove this once 25.02 becomes the base version of cudf in nemo-curator

# minhash in < 24.12 used to have a minhash(txt) api which was deprecated in favor of
# minhash(a, b) in 25.02 (in 24.12, minhash_permuted(a,b) was introduced)
MINHASH_DEPRECATED_API = (
CURRENT_CUDF_VERSION.base_version < parse_version("24.12").base_version
)
MINHASH_PERMUTED_AVAILABLE = (CURRENT_CUDF_VERSION.major == 24) & (
CURRENT_CUDF_VERSION.minor == 12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I'm confused, why are we doing it for 24.10 and 24.12 instead of for 24.12 and 25.02?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the code block here helps

if MINHASH_DEPRECATED_API:
     # this is 24.10, the variable is named minhash deprecated api, 
     # because now cudf only needs four args, rather than two args
    return ser.str.minhash(seeds=seeds, width=char_ngram)
else:
    if MINHASH_PERMUTED_AVAILABLE: 
        # this is 24.12 because in this version the four arg function is called minhash_permuted
        return ser.str.minhash_permuted(
            a=seeds_a, b=seeds_b, seed=seeds[0][0], width=char_ngram
        )
    else:
        # this is the 25.02 case where it's neither the deprecated case neither permuted is available
        return ser.str.minhash(a=seeds_a, b=seeds_b, seed=seeds[0][0], width=char_ngram)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I see, thanks! So are we waiting until the next NeMo Curator release (which will have 24.12 as the stable RAPIDS version) before removing MINHASH_DEPRECATED_API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct! And 25.02+ we wouldn't need any of these variables

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! I think you just need to update this branch, then feel free to merge.

@praateekmahajan praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 30, 2024
@praateekmahajan praateekmahajan merged commit d401333 into NVIDIA:main Dec 30, 2024
5 checks passed
sarahyurick added a commit to sarahyurick/NeMo-Curator that referenced this pull request Jan 3, 2025
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
sarahyurick added a commit that referenced this pull request Jan 17, 2025
* add changes from #389

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add scripts files

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add changes from #326

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* re add ParallelScoreFilter

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove _MapBuckets and _Shuffle from nemo_curator path

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* update api doc

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add changes from #445

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* Add changes from #478

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* final nits

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants