-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Report FP / FN negative variants in single table #107
Conversation
…-fn results for all covs of a callset and all callsets of a benchmark.
WalkthroughThe changes introduce enhancements to a Snakemake workflow for evaluating variant calling performance. Key modifications include the addition of new rules and functions to manage false positive and false negative data, improved handling of coverage data, and the introduction of new scripts for processing benchmark and stratification data. A new YAML configuration file is also created to structure data visualization and reporting. These updates refine the evaluation process and improve data organization within the workflow. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Snakemake
participant CoverageData
participant BenchmarkData
participant Report
User->>Snakemake: Trigger workflow
Snakemake->>CoverageData: Retrieve coverage information
Snakemake->>BenchmarkData: Collect FP/FN benchmarks
Snakemake->>Report: Generate reports
Report-->>User: Provide evaluation results
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (13)
workflow/scripts/collect-fp-fn-benchmarks.py (3)
1-5
: Use context manager for file handling.Consider using a context manager (
with
statement) for safer file handling.-sys.stderr = open(snakemake.log[0], "w") +with open(snakemake.log[0], "w") as log_file: + sys.stderr = log_file🧰 Tools
🪛 Ruff (0.8.0)
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name
snakemake
(F821)
7-10
: Add error handling and type hints.The function would benefit from:
- Error handling for file reading
- Type hints for better code maintainability
- Docstring explaining the function's purpose
-def load_data(path, callset): +def load_data(path: str, callset: str) -> pd.DataFrame: + """Load benchmark data from TSV file and add callset identifier. + + Args: + path: Path to the TSV file + callset: Identifier for the data source + + Returns: + DataFrame with loaded data and callset column + """ + try: d = pd.read_csv(path, sep="\t") d.insert(0, "callset", callset) return d + except Exception as e: + raise RuntimeError(f"Failed to load data from {path}: {str(e)}")
1-42
: Consider adding logging and memory optimization.For better operational visibility and scalability:
- Add logging of key operations (data loading, processing steps)
- Consider using
chunks
parameter inpd.read_csv
for large files- Use
dtype
specifications in pandas operations to optimize memory usageExample logging implementation:
import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[logging.FileHandler(snakemake.log[0]), logging.StreamHandler()] ) # Then use throughout the code: logging.info(f"Loading data from {len(snakemake.input.tables)} input files") logging.info(f"Processed {len(results)} rows of data")🧰 Tools
🪛 Ruff (0.8.0)
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name
snakemake
(F821)
16-16: Undefined name
snakemake
(F821)
16-16: Undefined name
snakemake
(F821)
42-42: Undefined name
snakemake
(F821)
workflow/resources/datavzrd/fp-fn-per-benchmark-config.yte.yaml (3)
4-5
: Consider extracting color codes to a theme configuration.The hardcoded color values could be moved to a common theme configuration file to ensure consistency across the visualization configs and make it easier to maintain a unified color scheme.
27-30
: Complete or remove the commented description block.The commented description block appears to be incomplete. Either complete it with relevant benchmark information or remove it if not needed.
23-44
: Add newline at end of file.The view configuration looks good, but please add a newline at the end of the file to comply with YAML standards.
predicted_genotype: display-mode: hidden +
🧰 Tools
🪛 yamllint (1.35.1)
[error] 44-44: no new line character at the end of file
(new-line-at-end-of-file)
workflow/scripts/collect-stratifications-fp-fn.py (3)
8-20
: Add input validation for coverage parameterThe function assumes the coverage parameter exists in snakemake.params.coverage_lower_bounds. Consider adding validation to handle missing or invalid values gracefully.
def get_cov_label(coverage): + if coverage not in snakemake.params.coverage_lower_bounds: + raise ValueError(f"Invalid coverage value: {coverage}") lower = snakemake.params.coverage_lower_bounds[coverage]🧰 Tools
🪛 Ruff (0.8.0)
9-9: Undefined name
snakemake
(F821)
12-12: Undefined name
snakemake
(F821)
33-39
: Consider implementing the commented validation logicThe TODO comment suggests important validation for variant prediction. This could help catch technical issues early in the pipeline.
Would you like me to help implement this validation logic or create a GitHub issue to track this enhancement?
1-56
: Consider improving modularity for testingThe script's main logic could be encapsulated in a main() function to facilitate unit testing and improve maintainability. This would also help separate the Snakemake-specific code from the core functionality.
Consider restructuring the script like this:
def main(input_files, output_file, coverages, coverage_lower_bounds): if input_files: report = pd.concat( load_data(f, cov) for cov, f in zip(coverages, input_files) ) report.to_csv(output_file, sep="\t", index=False) else: # Create empty DataFrame... if __name__ == "__main__": main(snakemake.input, snakemake.output[0], snakemake.params.coverages, snakemake.params.coverage_lower_bounds)🧰 Tools
🪛 Ruff (0.8.0)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
9-9: Undefined name
snakemake
(F821)
12-12: Undefined name
snakemake
(F821)
28-28: Undefined name
snakemake
(F821)
30-30: Undefined name
snakemake
(F821)
30-30: Undefined name
snakemake
(F821)
41-41: Undefined name
snakemake
(F821)
56-56: Undefined name
snakemake
(F821)
workflow/rules/common.smk (3)
406-413
: Consider refactoring to reduce code duplication withget_coverages
The function implements similar logic to
get_coverages
. Consider extracting the common logic into a shared helper function to improve maintainability.-def get_coverages_of_callset(callset): - benchmark = config["variant-calls"][callset]["benchmark"] - high_cov_status = benchmarks[benchmark].get("high-coverage", False) - if high_cov_status: - coverages = high_coverages - else: - coverages = low_coverages - return coverages +def _get_coverage_dict(high_cov_status): + return high_coverages if high_cov_status else low_coverages + +def get_coverages_of_callset(callset): + benchmark = config["variant-calls"][callset]["benchmark"] + high_cov_status = benchmarks[benchmark].get("high-coverage", False) + return _get_coverage_dict(high_cov_status) + +def get_coverages(wildcards): + if hasattr(wildcards, "benchmark"): + high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage", False) + else: + benchmark = config["variant-calls"][wildcards.callset]["benchmark"] + high_cov_status = benchmarks[benchmark].get("high-coverage", False) + return _get_coverage_dict(high_cov_status)
502-509
: Consider simplifying the benchmark set comprehension and improving function nameThe function works correctly but could be improved:
- The benchmark set comprehension could be simplified as
used_benchmarks
is already a set- The function name could be more descriptive, e.g.,
get_benchmark_fp_fn_report_paths
-def get_fp_fn_reports_benchmarks(wildcards): +def get_benchmark_fp_fn_report_paths(wildcards): for genome in used_genomes: yield from expand( "results/report/fp-fn/benchmarks/{benchmark}/{classification}", - benchmark={benchmark for benchmark in used_benchmarks}, + benchmark=used_benchmarks, classification=["fp", "fn"], )
601-605
: Consider reusing filtered callsets to avoid code duplicationThe coverage filtering logic is duplicated from
get_collect_fp_fn_callsets
. Consider reusing that function to maintain DRY principle.def get_collect_fp_fn_labels(wildcards): - callsets = get_genome_callsets(wildcards.genome) - callsets = [ - callset - for callset in callsets - if wildcards.cov in get_coverages_of_callset(callset) - ] + callsets = get_collect_fp_fn_callsets(wildcards) return get_callset_label_entries(callsets)workflow/rules/eval.smk (1)
377-393
: Consider explicitly specifying dependencies instead of using priorityIn the
collect_stratifications_fp_fn
rule, you have setpriority: 1
to ensure it runs after precision/recall computations. It's advisable to model dependencies explicitly through inputs or use Snakemake'safter
function to enforce rule order. This approach improves workflow clarity and maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
workflow/Snakefile
(1 hunks)workflow/resources/datavzrd/fp-fn-per-benchmark-config.yte.yaml
(1 hunks)workflow/rules/common.smk
(6 hunks)workflow/rules/eval.smk
(3 hunks)workflow/scripts/collect-fp-fn-benchmarks.py
(1 hunks)workflow/scripts/collect-stratifications-fp-fn.py
(1 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
workflow/resources/datavzrd/fp-fn-per-benchmark-config.yte.yaml
[error] 44-44: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff (0.8.0)
workflow/scripts/collect-fp-fn-benchmarks.py
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name snakemake
(F821)
16-16: Undefined name snakemake
(F821)
16-16: Undefined name snakemake
(F821)
42-42: Undefined name snakemake
(F821)
workflow/scripts/collect-stratifications-fp-fn.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
12-12: Undefined name snakemake
(F821)
28-28: Undefined name snakemake
(F821)
30-30: Undefined name snakemake
(F821)
30-30: Undefined name snakemake
(F821)
41-41: Undefined name snakemake
(F821)
56-56: Undefined name snakemake
(F821)
🔇 Additional comments (11)
workflow/resources/datavzrd/fp-fn-per-benchmark-config.yte.yaml (2)
10-11
: LGTM!
The view settings are appropriate for a data visualization interface.
18-21
: Verify table linking configuration.
The table linking configuration needs verification to ensure it correctly connects the callset data between tables.
✅ Verification successful
Table linking configuration is correctly set up
The table linking configuration is properly implemented across both files. The link configuration in fp-fn-per-benchmark-config.yte.yaml
correctly references:
- The source column "callset" in the current table
- The target table-row "results-table/callset" which exists in both configuration files
The linking structure is consistent with the datavzrd configuration format and matches between the files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for the existence of referenced table and column
# Look for other datavzrd configs that might define the target table
# Search for other datavzrd configs
fd -e yaml -e yml . workflow/resources/datavzrd -x grep -l "results-table/callset"
Length of output: 207
Script:
#!/bin/bash
# Let's examine both files to understand the table structure and linking
cat workflow/resources/datavzrd/precision-recall-config.yte.yaml workflow/resources/datavzrd/fp-fn-per-benchmark-config.yte.yaml
Length of output: 8596
workflow/Snakefile (1)
26-26
: Verify the implementation of get_fp_fn_reports_benchmarks
The addition of this input variable aligns well with the PR's objective to collect FP/FN results across benchmarks. However, let's verify its implementation.
✅ Verification successful
Function implementation is correct and properly used
The get_fp_fn_reports_benchmarks
function is correctly implemented in workflow/rules/common.smk
and yields paths for FP/FN reports across all benchmarks and genomes, following the expected path pattern results/report/fp-fn/benchmarks/{benchmark}/{classification}
. The function is only used in the Snakefile as intended, making this a valid and well-structured change.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the implementation of get_fp_fn_reports_benchmarks function
# Expected: Function definition in common.smk or related files
# Search for the function definition
echo "Searching for function definition:"
rg -A 5 "def get_fp_fn_reports_benchmarks"
# Search for any usage of the function
echo -e "\nSearching for function usage:"
rg "get_fp_fn_reports_benchmarks"
Length of output: 809
workflow/rules/common.smk (4)
481-485
: LGTM! Implementation follows established patterns
The function correctly generates input paths for FP/FN stratification results across all coverage levels, maintaining consistency with similar functions in the codebase.
526-531
: LGTM! Well-structured implementation
The function correctly collects FP/FN inputs for benchmarks, following the established pattern of similar collection functions.
582-588
: LGTM! Coverage filtering is correctly implemented
The function effectively filters callsets based on coverage levels, ensuring only relevant callsets are included.
593-596
: LGTM! Path structure aligns with PR objectives
The function correctly implements the new path structure for FP/FN results while maintaining consistency with the existing pattern.
workflow/rules/eval.smk (4)
341-343
: Update output and log paths for consistency with new folder structure
The updated output and log paths in the extract_fp_fn
rule reflect the new folder structure for FP/FN results, enhancing organization by grouping files under their respective callsets.
396-409
: New rule collect_fp_fn_benchmark
added
The collect_fp_fn_benchmark
rule effectively collects FP/FN benchmarks across all callsets within a benchmark. The inputs, outputs, parameters, and script are correctly specified.
418-425
: Adjust output directory and labels in report_fp_fn
rule
The changes to the output directory path and labels in the report_fp_fn
rule enhance clarity and maintain consistency in the reporting structure by organizing reports under the genomes
directory.
436-460
: New rule report_fp_fn_benchmark
added
The report_fp_fn_benchmark
rule appropriately generates reports for FP/FN benchmarks per benchmark. The inputs, outputs, parameters, and wrapper are correctly defined, integrating well with the existing reporting workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
workflow/resources/datavzrd/fp-fn-per-callset-config.yte.yaml (1)
1-37
: Add newline at end of fileThe configuration looks well-structured and follows YAML best practices. However, please add a newline at the end of the file to comply with standard file formatting practices.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 37-37: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
workflow/Snakefile
(1 hunks)workflow/resources/datavzrd/fp-fn-per-callset-config.yte.yaml
(1 hunks)workflow/rules/common.smk
(7 hunks)workflow/rules/eval.smk
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/Snakefile
🧰 Additional context used
🪛 yamllint (1.35.1)
workflow/resources/datavzrd/fp-fn-per-callset-config.yte.yaml
[error] 37-37: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (7)
workflow/rules/eval.smk (4)
341-343
: LGTM: Improved file path structure
The modified output paths now include coverage in the filename, which improves organization and makes it easier to track results across different coverage levels.
377-394
: LGTM: Well-structured collection rule with memory considerations
The new collect_stratifications_fp_fn
rule is well-implemented with:
- Appropriate priority setting to manage memory usage
- Clear parameter definitions for coverages
- Proper error handling through conda environment
396-409
: LGTM: Benchmark collection implementation
The collect_fp_fn_benchmark
rule effectively aggregates benchmark-specific FP/FN data with appropriate input handling and conda environment setup.
436-462
: Verify report configuration parameters
The report_fp_fn_callset
rule looks good, but let's verify the configuration parameters are correctly passed to datavzrd.
✅ Verification successful
Configuration parameters are correctly passed and consistent
The datavzrd configuration files show that:
- The
fp-fn-per-callset-config.yte.yaml
correctly handles the required parameters:- Uses
params.genome
andparams.version
for benchmark versioning - Properly handles
params.somatic
flag to conditionally hide genotype columns - Correctly configures coverage visualization with ordinal scale
- Table settings (separator, excel export) match other configurations
- Uses
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify datavzrd configuration parameters
# Check if the config file exists and validate its structure
cat workflow/resources/datavzrd/fp-fn-per-callset-config.yte.yaml
# Check for any other report configurations for consistency
fd -e yaml -e yml . workflow/resources/datavzrd/ -x cat {}
Length of output: 12741
workflow/rules/common.smk (3)
408-416
: LGTM: Well-implemented coverage determination
The get_coverages_of_callset
function effectively determines the appropriate coverage settings based on the benchmark configuration.
483-488
: LGTM: Comprehensive input collection functions
The new input collection functions are well-structured and maintain consistency with the existing codebase:
get_collect_stratifications_fp_fn_input
: Handles stratification data collectionget_fp_fn_reports_benchmarks
: Manages benchmark-specific reportingget_collect_fp_fn_benchmark_input
: Handles benchmark input collection
Also applies to: 504-511, 528-533
584-590
: LGTM: Improved callset filtering
The modifications to get_collect_fp_fn_callsets
and get_collect_fp_fn_labels
now properly filter callsets based on coverage settings, ensuring consistent data handling.
Also applies to: 603-607
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
workflow/rules/common.smk (2)
408-416
: Consider reusing get_coverages logic.The function duplicates logic from
get_coverages
. Consider refactoring to reuse the existing function:def get_coverages_of_callset(callset): - benchmark = config["variant-calls"][callset]["benchmark"] - high_cov_status = benchmarks[benchmark].get("high-coverage", False) - if high_cov_status: - coverages = high_coverages - else: - coverages = low_coverages - return coverages + class CallsetWildcards: + def __init__(self, callset): + self.callset = callset + + return get_coverages(CallsetWildcards(callset))
504-511
: Simplify benchmark set creation.The set comprehension is redundant since
used_benchmarks
is already a set.def get_fp_fn_reports_benchmarks(wildcards): for genome in used_genomes: yield from expand( "results/report/fp-fn/benchmarks/{benchmark}/{classification}", - benchmark={benchmark for benchmark in used_benchmarks}, + benchmark=used_benchmarks, classification=["fp", "fn"], )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
workflow/rules/common.smk
(6 hunks)
🔇 Additional comments (6)
workflow/rules/common.smk (6)
22-23
: LGTM: Clean implementation of used_callsets.
The set comprehension is an efficient way to collect all callset keys.
483-488
: LGTM: Well-structured input path generation.
The function follows the established pattern and correctly handles empty coverage filtering.
528-533
: LGTM: Clean implementation of benchmark-based FP/FN collection.
The function correctly uses get_benchmark_callsets
and follows the established pattern for input path generation.
584-590
: LGTM: Effective callset filtering based on coverage.
The function correctly filters callsets based on coverage compatibility using the new get_coverages_of_callset
function.
595-598
: LGTM: Well-structured input path generation.
The function correctly uses filtered callsets and follows the established pattern for input path generation.
602-603
: LGTM: Consistent label handling.
The function correctly uses filtered callsets and reuses get_callset_label_entries
for consistent label handling.
Change folder structure for fp-fn results, add collection of fp-fn results for all covs of a callset and all callsets of a benchmark.
Results for a each benchmark are currently sorted by callset and then coverage. No dependency sorting of the tables, yet.
New tables are not used in the datawzrd report yet and not requested by any rules.
Summary by CodeRabbit
New Features
Bug Fixes