Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Consolidate CLI #650

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

feat: Consolidate CLI #650

wants to merge 28 commits into from

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Dec 17, 2024

Harmonizes CLI between annotate seqvars, annotate strucvars and server, though annotate seqvars is the only one to actually make use of all of the options right now.

For server, we could add endpoints for frequencies and clinvar as well, while we're at it; otherwise there's no point in adding the --frequencies and --clinvar options.

annotate strucvars actually does not use any of the provided databases at the moment, since it mainly performs broad consequence predictions (transcript_variant, exon_variant, splice_region_variant, intron_variant, (up|down)stream_variant, intergenic_variant + some seqvars predictions for the breakpoint if the svtype is ins or bnd) (correct me if I'm wrong @holtgrewe), where transcripts usually are of no use.
So: should we keep these options for "the future" or just get rid of them for strucvars altogether? The old path_db wasn't used anywhere either.

Current state summary:

  • annotate seqvars and server run now use the same options (--transcripts, --frequencies, --clinvar, as well as transcript picking configurations)
  • annotate strucvars no longer expects the path-db argument, since that was unused, but also does not accept any of the options mentioned above
  • the genome release association is either derived from the metadata of the transcript db (for the --transcripts options) or from the file path in case of --frequencies and --clinvar (which expects rocksdb folders, this needs to be documented). This is a bit ugly.
  • server run now also exposes seqvars/frequency and seqvars/clinvar endpoints

Summary by CodeRabbit

Summary by CodeRabbit

Release Notes

  • New Features

    • Added new command-line interface for configuring transcript and annotation settings.
    • Introduced new API endpoints for ClinVar and frequency variant annotations.
    • Enhanced server-side variant query capabilities with genome release support.
  • Improvements

    • Restructured annotation configuration for more flexible source management.
    • Updated API with versioned endpoints (/api/v1/).
    • Added support for dynamic genome release handling.
  • Changes

    • Refactored structural variant annotation command-line arguments.
    • Updated server data structures to support new annotation types.
    • Added a new field for strand information in response structures.

Copy link
Contributor

coderabbitai bot commented Dec 17, 2024

Warning

Rate limit exceeded

@tedil has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 23 minutes and 3 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f1c010f and 5fd01b5.

📒 Files selected for processing (2)
  • openapi.schema.yaml (9 hunks)
  • src/server/run/actix_server/seqvars_clinvar.rs (1 hunks)

Walkthrough

This pull request introduces a comprehensive refactoring of the annotation system, focusing on command-line interface improvements and expanded server-side annotation capabilities. The changes centralize CLI argument handling by creating a new cli.rs module, restructure sequence and structural variant annotation configurations, and add new endpoints for ClinVar and frequency data querying. The modifications enhance the modularity and flexibility of the annotation process across multiple components of the application.

Changes

File Change Summary
src/annotate/cli.rs Added new structs and enums for CLI argument parsing, including Sources, TranscriptSettings, ConsequenceBy, TranscriptPickType, TranscriptPickMode, and TranscriptSource
src/annotate/mod.rs Added public module declaration for cli
src/annotate/seqvars/csq.rs Updated import statements, modified ConsequencePredictor provider visibility, removed TranscriptSource enum
src/annotate/seqvars/mod.rs Restructured Args struct, added annotation methods for FrequencyAnnotator and ClinvarAnnotator
src/server/run/actix_server/mod.rs Added new modules for ClinVar and frequency annotations, updated WebServerData struct
src/server/run/actix_server/seqvars_clinvar.rs Implemented ClinVar annotation endpoint with query and response structures
src/server/run/actix_server/seqvars_frequencies.rs Added frequency annotation endpoint with comprehensive query and result handling
src/server/run/mod.rs Updated Args struct, enhanced annotation source initialization
src/verify/seqvars.rs Centralized imports for ConsequenceBy, TranscriptPickMode, and TranscriptPickType
openapi.schema.yaml Updated API version and added new endpoints and schemas for ClinVar and frequency queries
utils/docker/entrypoint.sh Modified database path handling and added new paths for transcripts and frequencies

Sequence Diagram

sequenceDiagram
    participant CLI as Command Line Interface
    participant Server as Annotation Server
    participant Annotator as Variant Annotator
    participant Database as Annotation Databases

    CLI->>Server: Send Variant Query
    Server->>Annotator: Initialize with Sources
    Annotator->>Database: Retrieve Annotation Data
    Database-->>Annotator: Return Annotation Results
    Annotator-->>Server: Provide Annotation Details
    Server-->>CLI: Return Annotated Variant Information
Loading

Poem

🐰 Annotation Rabbit's Delight

With CLI flags now so bright,
Variants dance in data's light,
Frequencies and ClinVar's might,
Endpoints singing with pure delight,
Our code hops to a new height! 🧬


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@tedil tedil marked this pull request as ready for review January 2, 2025 15:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (27)
src/server/run/mod.rs (4)

1-6: Consider consolidating imports.

There are numerous imports from adjacent modules (cli, csq, etc.). If you find yourself refactoring frequently or referencing these across files, consider grouping or re-exporting from a central location for easier management and to reduce repetitive import statements.


153-181: Enhance user guidance & maintain consistency with CLI parameters.

These info logs display helpful endpoint URLs for the user. Given the PR objective is to consolidate CLI options across tools, ensure the displayed guidance accurately reflects any new or updated CLI parameters so as not to confuse users.


206-241: Extract repeated logic into modular functions.

Inside the run function, lines 206–241 deal with transcript database loading, merging, and building different predictors. Similar logic recurs for frequency/ClinVar as well, making the code somewhat lengthy. Extracting such repeated logic (e.g., merging DBs, building annotators) into dedicated helper functions would improve readability and testability.


285-285: Ensure consistent logs and finalize initialization message.

The final logging statement mentions data loading but does not confirm readiness of new endpoints (ClinVar, frequencies, etc.). A clearer statement about which services/annotators are enabled would provide better diagnostics during server startup.

src/annotate/seqvars/csq.rs (4)

2-6: Revisit imports for clarity and redundancy.

You import both ann symbols and TranscriptSource in quick succession. Verify there is no duplication or overshadowing occurring, especially as TranscriptSource was relocated from another file.


Line range hint 1787-1788: Add doc comments for ConsequencePredictor usage.

Lines 1787-1788 reveal the new ConsequenceAnnotator struct. Provide doc comments about how it internally uses ConsequencePredictor and note any side effects or assumptions on the input data, so that future maintainers can quickly grasp its purpose.


Line range hint 1796-1800: Suggest clarifying error messages.

When reading from transcript DB or building the ConsequencePredictor, any failure triggers a generic error. Consider providing more context (e.g., "failed to build predictor with x parameter"), to speed up root cause analysis.


Line range hint 2072-2116: Streamline file path checks for assembly mismatch.

When skipping DB loading if path_component(a) is not in the RosksDB path (line 2079, 2103), consider centralizing these checks in a small utility function. That avoids duplication and helps if new types of DBs are added later.

src/annotate/seqvars/mod.rs (4)

11-12: Align naming conventions.

Lines 11–12 show that use self::ann::{AnnField, Consequence, FeatureBiotype} is placed next to pub use crate::annotate::cli::{Sources, TranscriptSettings}. Consistently naming re-exports or grouping them by domain can improve discoverability and maintainability.


1313-1313: Consider an enum or trait-based approach.

AnnotatorEnum lumps frequency, ClinVar, and Consequence. With further expansion (like coverage or variant quality), an enum might become lengthy. A trait-based approach for different “Annotator” types can keep the code open for extension while ensuring each new type implements consistent methods.


1526-1531: Plan for partial annotation returns.

When calling annotate_variant, you either return Some(...) with the frequency data or an error if none is found. Decide if an empty result is truly an error or if you can return Ok(None) to gracefully handle the absence of data—especially if a partial annotation is acceptable.


2014-2148: Optimize repeated DB loads.

Between lines 2024–2116, various databases are loaded or merged. In typical usage, repeated loading might cost time or memory. Evaluate whether to cache references, load them once at server startup, or share them across modules to reduce overhead.

src/server/run/actix_server/mod.rs (2)

14-16: Align module names & doc comments.

You added pub mod seqvars_clinvar; and pub mod seqvars_frequencies;. A short doc comment describing each module’s responsibility can help new contributors locate the relevant business logic quickly.


56-60: Check memory usage for HashMaps in WebServerData.

frequency_annotators and clinvar_annotators might increase memory usage if multiple large DBs are opened. If usage grows, consider lazy loading or caching only the needed data per request to avoid overhead.

src/annotate/cli.rs (5)

20-25: Consider consistent documentation for each database field.
While transcript and frequency databases are documented, adding explicit availability links or disclaimers for the ClinVar database can further align user expectations.


28-33: Validate default transcript source.
Defaulting transcript_source to Both can lead to higher resource usage or less deterministic outputs. Confirm that a user truly benefits from including all sources by default or consider a domain-specific default (e.g., Ensembl).


53-71: Extend ConsequenceBy or rename the Allele variant.
There may be future expansions like Variant or Haplotype. If planned, clarifying the difference between Allele and a potential Variant variant can preempt confusion.


73-94: Provide usage guidance for each TranscriptPickType.
The enumerators such as ManeSelect, Basic, etc., have domain-specific implications. Including short docs or references for each helps maintainers and users choose correctly.


96-101: Ensure consistent naming for picking strategies.
pick_transcript_mode uses First and All, but some codebases might prefer Single vs. All for clarity. Consistency across the pipeline fosters readability.

src/server/run/actix_server/seqvars_clinvar.rs (2)

53-99: Handle missing or partial annotations gracefully.
annotate_variant may return None if the variant is not found. You’re appending the annotation to the result array only if it exists. Consider returning an empty array or a descriptive placeholder in the response to inform clients about missing data.


112-132: Document endpoint differences in code comments.
You provide both a deprecated endpoint (/seqvars/clinvar) and a new endpoint (/api/v1/seqvars/clinvar). Clarify or cross-reference their differences to help maintainers track usage over time.

src/server/run/actix_server/seqvars_frequencies.rs (2)

33-39: Future-proof the FrequencyResultEntry variants.
Additional variant categories (e.g., plasmids) might be integrated. Consider whether a catch-all variant or forward-compatible design is needed.


160-180: Deprecate old frequency endpoint gracefully.
Similar to clinvar, you have a deprecated /seqvars/frequency and the recommended /api/v1/seqvars/frequency. Add deprecation notices in the doc comments or logs to guide users to the new endpoint.

src/verify/seqvars.rs (1)

10-10: Imports consolidated into cli module.
Switching these enums to be imported from cli is clearer for the user-facing command-line logic. This change is good, but ensure that none of these imports are also needed by the older seqvars module for fallback or backward compatibility.

src/annotate/strucvars/mod.rs (3)

88-88: Hardcoded default overlap threshold.
pub min_overlap: f32 with default 0.8 might be too strict or loose in some downstream analyses. Consider whether a user-supplied range validation or a best-practice note in your docs would help.


92-92: Slack around break-ends.
Using a default of 50 bases makes sense for many use cases but might need adjustments in certain contexts. Possibly document or log a warning if results are unusual with large structural variants.


100-100: Optional date string.
file_date being optional is flexible. Consider pattern validation (e.g., YYYYMMDD) for clarity, or at least a log message if the user sets an unexpected format.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfd396b and 720b934.

📒 Files selected for processing (13)
  • src/annotate/cli.rs (1 hunks)
  • src/annotate/mod.rs (1 hunks)
  • src/annotate/seqvars/csq.rs (4 hunks)
  • src/annotate/seqvars/mod.rs (27 hunks)
  • src/annotate/seqvars/provider.rs (1 hunks)
  • src/annotate/strucvars/mod.rs (2 hunks)
  • src/common/noodles.rs (2 hunks)
  • src/server/run/actix_server/mod.rs (3 hunks)
  • src/server/run/actix_server/seqvars_clinvar.rs (1 hunks)
  • src/server/run/actix_server/seqvars_csq.rs (3 hunks)
  • src/server/run/actix_server/seqvars_frequencies.rs (1 hunks)
  • src/server/run/mod.rs (8 hunks)
  • src/verify/seqvars.rs (1 hunks)
🔇 Additional comments (26)
src/server/run/mod.rs (2)

Line range hint 117-133: Encapsulate input validation.

The Args struct has multiple fields that expect external input (e.g., listen_host, listen_port). Consider validating or sanitizing these fields to avoid potential misconfiguration (for example, verifying if the host is reachable or ensuring the port is in a valid range).


242-247: Proactively handle missing transcript DB scenario.

When no transcripts are provided (else branch at line 242), you log a warning. Consider throwing an explicit error or returning early if transcripts are absolutely required for certain endpoints. This could prevent runtime confusion and alert operators quickly about the missing data.

src/annotate/seqvars/csq.rs (3)

64-64: Provider field visibility changed to pub(crate).

Making provider accessible within the crate can aid in testing or hooking additional code. Validate that external modules do not require direct access; otherwise, consider making a public getter.


Line range hint 152-184: Check boundary conditions for partial decoding.

Methods like normalize_variant, strip common suffixes/prefixes, or overlaps(...) can risk edge-case failures (empty strings, negative positions, etc.). Ensure tests exist for corner cases to maintain robust boundary handling.


Line range hint 2014-2022: Graceful fallback on unknown assemblies.

proto_assembly_from(assembly) returns None if an assembly is not recognized. If the assembly is crucial for annotation, consider either logging a warning or returning an error. Otherwise, you risk silent failures where the annotation might remain incomplete.

src/annotate/seqvars/mod.rs (2)

Line range hint 1405-1455: Revisit error handling here.

In annotate_record_xy, lines 1415–1455 record potential database lookups (X or Y). If the data is missing or partial, consider whether you should fail silently or warn the user. Clarifying this would help debug missing frequency information on sex chromosomes.


1728-1732: Communicate the range of a known classification.

annotate_variant for ClinVar returns classification details. If the classification set is limited (benign, likely benign, etc.), consider enumerating them or providing more context for how to interpret them. That could help consumers of your API handle classification consistently.

src/annotate/mod.rs (1)

7-7: Visibility rationale for pub(crate) mod cli.

Making cli a crate-private module is adequate if external applications do not require direct access to CLI structs. If you anticipate external crates hooking into your CLI, consider making it fully public or providing a public re-export.

src/server/run/actix_server/mod.rs (2)

9-9: Validate concurrency & data races with new annotators.

ClinvarAnnotator and FrequencyAnnotator are introduced here. Confirm they safely support concurrent read operations from the DB (RocksDB) when registered for multiple endpoints in parallel requests. Typically, opening them in read-only mode is fine, but verifying concurrency safety is advised.


78-81: Endpoint expansions well-structured.

Adding .service(seqvars_frequencies::handle) and .service(seqvars_clinvar::handle) cleanly integrates new annotation endpoints. This approach fosters code separation and clarity in the overall Actix application architecture.

src/annotate/cli.rs (3)

4-6: Clarify the usage of #[group(required = true, multiple = true)].
The Clap group annotation forces these options to be treated as a group, requiring at least one occurrence. However, having multiple = true implies multiple group members can be set. Confirm that this design meets user requirements for specifying multiple sources.


37-43: Ensure transcript picking is indeed optional.
pick_transcript is a Vec<TranscriptPickType>. An empty vector implies no picking, yet you must confirm that the logic in downstream code properly handles an empty pick list.


103-123: Confirm serialization expectations for TranscriptSource.
Enums exposed via CLI and JSON often need consistent casing or renaming. Verify if Both should remain capitalized or adapted for user-friendly input/output.

src/server/run/actix_server/seqvars_clinvar.rs (2)

14-31: Validate required fields in ClinvarQuery.
The fields chromosome, position, reference, and alternative are mandatory for queries. Confirm that user input or upstream code always provides them. You might want to enforce them as non-empty strings or valid numeric ranges for position.


33-38: Confirm usage of multiple classification fields.
clinvar_germline_classification is a vector. Does the database commonly return multiple classifications for a single variant? If so, ensure the UI or client code can handle multiple classifications properly.

src/server/run/actix_server/seqvars_frequencies.rs (2)

14-31: Check uppercase/lowercase usage for SPDI parameters.
Some pipelines may pass contig IDs as “chr1” or plain “1.” Clarify whether chromosome must be stripped of a “chr” prefix or can it be upper/lower case. This helps prevent mismatch scenarios.


101-147: Guard against partial frequency data.
When annotate_variant returns partial or no data, an empty vector is returned. Ensure you have appropriate fallback logic (like zero counts) and that no further steps break due to missing keys.

src/common/noodles.rs (2)

63-63: Ensure backward compatibility with added lifetime parameter.
By adding 'a to LocalBoxStream<'a, ...>, references to the stream are now bound to the function context. Check if existing downstream code that consumes these streams needs adjustments for the new lifetime constraints.


79-79: Leverage .record_bufs(header) consistently.
The logic for Vcf calls .record_bufs(header), while Bcf uses .records() and RecordBuf::try_from_variant_record(...). Verify that these calls remain in sync regarding returned fields, especially if extended.

src/server/run/actix_server/seqvars_csq.rs (3)

76-77: Ensure correct usage of strand sign.
The new strand field is an i32, which allows for different sign conventions (e.g., -1 for minus, +1 for plus). Confirm that all places setting or reading this value interpret the strand consistently (e.g., no dependencies on 0 vs. ±1).


151-151: Confirm consistent extraction of strand value.
This code extracts strand from AnnField. Ensure that no panics occur if ann_field has an unexpected strand value, and verify that any default or error behavior is properly handled.


179-179: Include strand in sorting or grouping logic if needed.
As the code now includes strand in each SeqvarsCsqResultEntry, confirm whether sorting or grouping downstream might need to account for the strand sign. If it doesn’t, this is fine to leave as-is.

src/annotate/seqvars/provider.rs (1)

5-5: Consistent referencing of transcript picking types.
Referencing TranscriptPickMode and TranscriptPickType from the cli module keeps all command-line configurations centralized. Verify that no duplicate definitions remain in seqvars to avoid confusion.

src/annotate/strucvars/mod.rs (3)

64-64: Validate presence or content of the input PED file.
pub path_input_ped: String with #[arg(long)] looks straightforward. Consider adding checks to ensure the file exists and is readable before proceeding to annotation.


68-68: Multiple VCF inputs.
pub path_input_vcf: Vec<String> supports multiple inputs. This is useful for merging or analyzing multiple VCF sources. Confirm that you handle possible collisions with sample names and that you address any potential concurrency in reading multiple VCFs.


72-72: Flattened subcommands into PathOutput.
Neatly organizes different output paths under a single struct. This is a good approach. Check if any future expansions (e.g., writing both VCF and TSV at once) require adjusting the grouping constraint.

src/server/run/mod.rs Outdated Show resolved Hide resolved
src/server/run/mod.rs Outdated Show resolved Hide resolved
src/annotate/seqvars/mod.rs Show resolved Hide resolved
@tedil tedil requested a review from holtgrewe January 2, 2025 16:34
@tedil tedil self-assigned this Jan 2, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
openapi.schema.yaml (1)

63-63: Fix incorrect summary in endpoint description.

The summary for the ClinVar endpoint incorrectly mentions "gnomAD frequencies".

-      summary: Query for gnomAD frequencies of a variant.
+      summary: Query for ClinVar information of a variant.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 720b934 and f1c010f.

📒 Files selected for processing (3)
  • openapi.schema.yaml (9 hunks)
  • src/server/run/mod.rs (8 hunks)
  • utils/docker/entrypoint.sh (2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
utils/docker/entrypoint.sh

[warning] 36-36: Quote this to prevent word splitting.

(SC2046)


[warning] 37-37: Quote this to prevent word splitting.

(SC2046)


[warning] 38-38: Quote this to prevent word splitting.

(SC2046)


[warning] 39-39: Quote this to prevent word splitting.

(SC2046)


[warning] 40-40: Quote this to prevent word splitting.

(SC2046)


[warning] 41-41: Quote this to prevent word splitting.

(SC2046)

🔇 Additional comments (5)
utils/docker/entrypoint.sh (1)

16-25: LGTM! Well-organized path structure.

The new path organization clearly separates different data types (transcripts, frequencies, ClinVar) and genome builds (GRCh37, GRCh38) under a centralized location.

src/server/run/mod.rs (2)

249-265: Improved handling of multiple frequency databases.

The code now properly handles multiple frequency databases by:

  1. Warning users when no databases are loaded
  2. Using only the first database when multiple are specified
  3. Warning users that multiple databases are not supported

However, consider documenting the selection behavior in user documentation.


267-281: Consistent handling of multiple ClinVar databases.

The implementation mirrors the frequency database handling, providing consistent behavior and clear warnings.

openapi.schema.yaml (2)

59-110: LGTM! Well-designed ClinVar endpoint.

The ClinVar endpoint follows REST best practices with:

  • Consistent parameter naming with other endpoints
  • Clear parameter descriptions
  • Proper error responses

170-221: LGTM! Well-designed frequency endpoint.

The frequency endpoint maintains consistency with:

  • Same parameter structure as ClinVar endpoint
  • Clear separation of autosomal/gonosomal/mitochondrial results
  • Proper error handling

Comment on lines +36 to +41
$(test -e "$PATH_TRANSCRIPTS_37" && echo --transcripts "$PATH_TRANSCRIPTS_37") \
$(test -e "$PATH_TRANSCRIPTS_38" && echo --transcripts "$PATH_TRANSCRIPTS_38") \
$(test -e "$PATH_FREQUENCIES_37" && echo --frequencies "$PATH_FREQUENCIES_37") \
$(test -e "$PATH_FREQUENCIES_38" && echo --frequencies "$PATH_FREQUENCIES_38") \
$(test -e "$PATH_CLINVAR_37" && echo --clinvar "$PATH_CLINVAR_37") \
$(test -e "$PATH_CLINVAR_38" && echo --clinvar "$PATH_CLINVAR_38") \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential word splitting issues in command substitutions.

The current command substitutions could be vulnerable to word splitting if paths contain spaces.

Apply this diff to properly quote the command substitutions:

-      $(test -e "$PATH_TRANSCRIPTS_37" && echo --transcripts "$PATH_TRANSCRIPTS_37") \
-      $(test -e "$PATH_TRANSCRIPTS_38" && echo --transcripts "$PATH_TRANSCRIPTS_38") \
-      $(test -e "$PATH_FREQUENCIES_37" && echo --frequencies "$PATH_FREQUENCIES_37") \
-      $(test -e "$PATH_FREQUENCIES_38" && echo --frequencies "$PATH_FREQUENCIES_38") \
-      $(test -e "$PATH_CLINVAR_37" && echo --clinvar "$PATH_CLINVAR_37") \
-      $(test -e "$PATH_CLINVAR_38" && echo --clinvar "$PATH_CLINVAR_38") \
+      "$(test -e "$PATH_TRANSCRIPTS_37" && echo --transcripts "$PATH_TRANSCRIPTS_37")" \
+      "$(test -e "$PATH_TRANSCRIPTS_38" && echo --transcripts "$PATH_TRANSCRIPTS_38")" \
+      "$(test -e "$PATH_FREQUENCIES_37" && echo --frequencies "$PATH_FREQUENCIES_37")" \
+      "$(test -e "$PATH_FREQUENCIES_38" && echo --frequencies "$PATH_FREQUENCIES_38")" \
+      "$(test -e "$PATH_CLINVAR_37" && echo --clinvar "$PATH_CLINVAR_37")" \
+      "$(test -e "$PATH_CLINVAR_38" && echo --clinvar "$PATH_CLINVAR_38")" \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
$(test -e "$PATH_TRANSCRIPTS_37" && echo --transcripts "$PATH_TRANSCRIPTS_37") \
$(test -e "$PATH_TRANSCRIPTS_38" && echo --transcripts "$PATH_TRANSCRIPTS_38") \
$(test -e "$PATH_FREQUENCIES_37" && echo --frequencies "$PATH_FREQUENCIES_37") \
$(test -e "$PATH_FREQUENCIES_38" && echo --frequencies "$PATH_FREQUENCIES_38") \
$(test -e "$PATH_CLINVAR_37" && echo --clinvar "$PATH_CLINVAR_37") \
$(test -e "$PATH_CLINVAR_38" && echo --clinvar "$PATH_CLINVAR_38") \
"$(test -e "$PATH_TRANSCRIPTS_37" && echo --transcripts "$PATH_TRANSCRIPTS_37")" \
"$(test -e "$PATH_TRANSCRIPTS_38" && echo --transcripts "$PATH_TRANSCRIPTS_38")" \
"$(test -e "$PATH_FREQUENCIES_37" && echo --frequencies "$PATH_FREQUENCIES_37")" \
"$(test -e "$PATH_FREQUENCIES_38" && echo --frequencies "$PATH_FREQUENCIES_38")" \
"$(test -e "$PATH_CLINVAR_37" && echo --clinvar "$PATH_CLINVAR_37")" \
"$(test -e "$PATH_CLINVAR_38" && echo --clinvar "$PATH_CLINVAR_38")" \
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 36-36: Quote this to prevent word splitting.

(SC2046)


[warning] 37-37: Quote this to prevent word splitting.

(SC2046)


[warning] 38-38: Quote this to prevent word splitting.

(SC2046)


[warning] 39-39: Quote this to prevent word splitting.

(SC2046)


[warning] 40-40: Quote this to prevent word splitting.

(SC2046)


[warning] 41-41: Quote this to prevent word splitting.

(SC2046)

@tedil
Copy link
Contributor Author

tedil commented Jan 3, 2025

What is the current layout of the database folder (--path-db)? What should it ideally be?

One issue, for example, is that we do not provide a combined refseq+ensembl transcript db anymore in the mehari-data-tx releases. That means you either have to specify both the refseq and ensembl version of the txdb, or merge the DBs beforehand and supply that instead (or we provide the pre-merged versions in the mehari-data-tx workflow, too). Either option is fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant