diff --git a/_quarto.yml b/_quarto.yml index 06ad85d..d0061a5 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -21,6 +21,8 @@ book: - part: chapters/representations/index.qmd chapters: - chapters/representations/biocframe.qmd + - chapters/representations/iranges.qmd + - chapters/representations/genomicranges.qmd - chapters/summary.qmd - chapters/references.qmd diff --git a/chapters/representations/biocframe.qmd b/chapters/representations/biocframe.qmd index 886324f..3e917d6 100644 --- a/chapters/representations/biocframe.qmd +++ b/chapters/representations/biocframe.qmd @@ -1,12 +1,12 @@ # `BiocFrame` - Bioconductor-like data frames -This package implements the BiocFrame class, a Bioconductor-friendly alternative to Pandas DataFrame. The main advantage is that the BiocFrame makes no assumption on the types of the columns - as long as an object has a length (`__len__`) and slicing methods (`__getitem__`), it can be used inside a `BiocFrame`. +`BiocFrame` class is a Bioconductor-friendly alternative to Pandas `DataFrame`. Its key advantage lies in not making assumptions on the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`. -This allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects. +This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects. ## Installation -Package is published to [PyPI](https://pypi.org/project/biocframe/) +To get started, install the package from [PyPI](https://pypi.org/project/biocframe/) ```bash pip install biocframe @@ -14,7 +14,7 @@ pip install biocframe ## Construction -To construct a `BiocFrame` object, simply provide the data as a dictionary. +To create a `BiocFrame` object, simply provide the data as a dictionary. ```{python} from biocframe import BiocFrame @@ -29,7 +29,7 @@ print(bframe) ::: {.callout-tip} You can specify complex objects as columns, as long as they have some "length" equal to the number of rows. -For example, we can nest a `BiocFrame` inside another `BiocFrame`: +For example, we can embed a `BiocFrame` within another `BiocFrame`: ::: @@ -50,22 +50,18 @@ print(bframe2) ## Extracting data -Properties can be accessed directly from the object: +Properties can be directly accessed from the object: ```{python} print("shape:", bframe.shape) - print("column names (functional style):", bframe.get_column_names()) - print("column names (as property):", bframe.column_names) # same as above ``` We can fetch individual columns: ```{python} - print("functional style:", bframe.get_column("ensembl")) - print("w/ accessor", bframe["ensembl"]) ``` @@ -75,9 +71,9 @@ And we can get individual rows as a dictionary: bframe.get_row(2) ``` -::: {.callout-important} -To extract a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator. -This accepts different subsetting arguments like a boolean vector, a `slice` object, a sequence of indices, or row/column names. +::: {.callout} +To retrieve a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator. +This operator accepts different subsetting arguments, such as a boolean vector, a `slice` object, a sequence of indices, or row/column names. ::: ```{python} @@ -95,8 +91,7 @@ print("\nShort-hand to get a single column: \n", bframe["ensembl"]) ### Preferred approach -To set `BiocFrame` properties, we encourage a **functional style** of programming that avoids mutating the object. -This avoids inadvertent modification of `BiocFrame`s that are part of larger data structures. +To set `BiocFrame` properties, we encourage a **functional style** of programming that avoids mutating the object. This avoids inadvertent modification of `BiocFrame` instances within larger data structures. ```{python} modified = bframe.set_column_names(["column1", "column2"]) @@ -122,10 +117,6 @@ print(modified) Change the row or column names: -::: {.callout-note} -The functional style allows you to chain multiple operations as in the example below. -::: - ```{python} modified = bframe.\ set_column_names(["FOO", "BAR"]).\ @@ -133,6 +124,11 @@ modified = bframe.\ print(modified) ``` + +::: {.callout-tip} +The functional style allows you to chain multiple operations. +::: + We also support Bioconductor's metadata concepts, either along the columns or for the entire object: ```{python} @@ -144,8 +140,7 @@ print(modified) ### The other way -Properties can also be set by direct assignment for in-place modification. -We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures. +Properties can also be set by direct assignment for in-place modification. We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures. Nonetheless: ```{python} @@ -154,7 +149,7 @@ testframe.column_names = ["column1", "column2" ] print(testframe) ``` -::: {.callout-caution} +::: {.callout-important} Warnings are raised when properties are directly mutated. These assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`. It is best to do this only if the `BiocFrame` object is not being used anywhere else; otherwise, it is safer to just create a (shallow) copy via the default `in_place = False`. @@ -169,8 +164,7 @@ testframe[1:3, ["column1","column2"]] = BiocFrame({"x":[4, 5], "y":["E", "F"]}) ## Combining objects -**BiocFrame** implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils). -So, for example, to combine by row: +`BiocFrame` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils). For example, to combine by row: ```{python} import biocutils @@ -216,7 +210,7 @@ combined = biocutils.relaxed_combine_rows(bframe1, modified2) print(combined) ``` -Similarly, if the rows are different, we can use **BiocFrame**'s `merge` function: +Similarly, if the rows are different, we can use `BiocFrame`'s `merge` function: ```{python} from biocframe import merge @@ -230,9 +224,7 @@ print(combined) ## Interop with pandas -`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R. -Most users will probably prefer to work with **pandas** `DataFrame` objects for their actual analyses. -This conversion is easily achieved: +`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R. Most users will probably prefer to work with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved: ```{python} from biocframe import BiocFrame @@ -256,14 +248,14 @@ print(out) ## Empty Frames -We can create empty `BiocFrame` objects that hold no information except the number of rows. This is useful when `BiocFrame` objects are part of larger datastructures but hold no data. +We can create empty `BiocFrame` objects that only specify the number of rows. This proves beneficial in situations where `BiocFrame` objects are integrated into more extensive data structures but do not possess any data themselves. ```{python} empty = BiocFrame(number_of_rows=100) print(empty) ``` -Most operations described in the document can also be performed on an empty `BiocFrame` object. +Most operations described in this document can be performed on an empty `BiocFrame` object. ```{python} print("Column names:", empty.column_names) @@ -275,6 +267,7 @@ print("\nSubsetting an empty BiocFrame: \n", subset_empty) ::: {.callout-tip} Similarly one can create an empty `BiocFrame` with only row names. ::: + ## Further reading Check out [the reference documentation](https://biocpy.github.io/BiocFrame/) for more details. diff --git a/chapters/representations/genomicpacks.qmd b/chapters/representations/genomicpacks.qmd new file mode 100644 index 0000000..a55bb7d --- /dev/null +++ b/chapters/representations/genomicpacks.qmd @@ -0,0 +1 @@ +## Genomic analysis \ No newline at end of file diff --git a/chapters/representations/genomicranges.qmd b/chapters/representations/genomicranges.qmd new file mode 100644 index 0000000..d6f6037 --- /dev/null +++ b/chapters/representations/genomicranges.qmd @@ -0,0 +1,560 @@ +# `GenomicRanges`: Genomic analysis + +`GenomicRanges` is a Python package designed to handle genomic locations and facilitate genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html). **_Remember, intervals are inclusive on both ends and starts at 1._** + +::: {.callout-note} +The class implementation aligns closely with Bioconductor's [R/GenomicRanges package](https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf). +::: + +## Construct a `GenomicRanges` object + +To construct a `GenomicRanges` object from interval ranges (**Preferred way**) + +```{python} +from genomicranges import GenomicRanges +from iranges import IRanges +from biocframe import BiocFrame +from random import random + +gr = GenomicRanges( + seqnames=[ + "chr1", + "chr2", + "chr3", + "chr2", + "chr3", + ], + ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]), + strand=["*", "-", "*", "+", "-"], + mcols=BiocFrame( + { + "score": range(0, 5), + "GC": [random() for _ in range(5)], + } + ), +) + +print(gr) +``` + +### From UCSC or GTF file + +You can also import genomes from UCSC or load a genome annotation from a GTF file: + +```{python} +import genomicranges + +# gr = genomicranges.read_gtf() + +# OR + +# gr = genomicranges.read_ucsc(genome="hg19") +# print(gr) +``` + +### Pandas DataFrame + +If your genomic coordinates are represented as a pandas `DataFrame`, convert this into `GenomicRanges` if it contains the necessary columns. + +::: {.callout-note} +The `DataFrame` must contain columns `seqnames`, `starts` and `ends` to represent genomic coordinates. The rest of the columns are considered metadata and will be available in the `mcols` slot of the `GenomicRanges` object. +::: + +```{python} +from genomicranges import GenomicRanges +import pandas as pd + +df = pd.DataFrame( + { + "seqnames": ["chr1", "chr2", "chr1", "chr3", "chr2"], + "starts": [101, 102, 103, 104, 109], + "ends": [112, 103, 128, 134, 111], + "strand": ["*", "-", "*", "+", "-"], + "score": range(0, 5), + "GC": [random() for _ in range(5)], + } +) + +gr = GenomicRanges.from_pandas(df) +print(gr) +``` + +### Set sequence information + +The package provides a `SeqInfo` class to update or modify sequence information stored in the object. earn more about this in the [GenomeInfoDb package](https://bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html). + +```{python} +from genomicranges import SeqInfo + +seq_obj = { + "seqnames": ["chr1", "chr2", "chr3",], + "seqlengths": range(100, 103), + "is_circular": [random() < 0.5 for _ in range(3)], + "genome": "hg19", +} + +seq = SeqInfo(seq_obj) +gr.seq_info = seq +print(gr) +``` + +## Getters/Setters + +Getters are available to access various properties. + +```{python} +# access sequence names +gr.seqnames + +# access all start positions +gr.start + +# access annotation information if available +gr.seq_info + +# compute and return the widths of each region +gr.width + +# access metadata columns, everything other than genomic locations +print(gr.mcols) +``` + +### Setters + +::: {.callout-important} +All property-based setters are `in_place` operations. Methods are available to get and set properties on GenomicRanges. +::: + +```{python} +gr.mcols = gr.mcols.set_column("score", range(1,6)) + +# or use an in-place operation +gr.mcols.set_column("score", range(1,6), in_place=True) + +print(gr.mcols) +``` + +### Access ranges + +`ranges()` is a generic method to access only the genomic coordinates: + +```{python} +# or gr.get_ranges() + +print(gr.ranges) +``` + +## Subset operations + +You can subset a `GenomicRange` object using the subset (`[]`) operator. This operation accepts different slice input types, such as a boolean vector, a `slice`` object, a list of indices, or row/column names to subset. + +```{python} +# slice the first 3 rows +gr[:3] + +# slice 1, 3 and 2nd rows +print(gr[[1,3,2]]) +``` + +## Iterate over intervals + +You can iterate over the intervals of a `GenomicRanges` object. `rowname` is `None` if the object does not contain any row names. + +```{python} +for rowname, row in gr[:2]: + print(rowname, row) +``` + +## Intra-range transformations + +For detailed description of these methods, refer to Bioconductor's [GenomicRanges documentation](https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf) + +- **flank**: Flank the intervals based on **start** or **end** or **both**. +- **shift**: Shifts all the ranges specified by the **shift** argument. +- **resize**: Resizes the ranges to the specified width where either the **start**, **end**, or **center** is used as an anchor. +- **narrow**: Narrows the ranges. +- **promoters**: Promoters generates promoter ranges for each range relative to the TSS.The promoter range is expanded around the TSS according to the upstream and downstream parameters. +- **restrict**: Restricts the ranges to the interval(s) specified by the start and end arguments. +- **trim**: Trims out-of-bound ranges located on non-circular sequences whose length is not NA. + +```{python} +gr = GenomicRanges( + seqnames=[ + "chr1", + "chr2", + "chr3", + "chr2", + "chr3", + ], + ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]), + strand=["*", "-", "*", "+", "-"], + mcols=BiocFrame( + { + "score": range(0, 5), + "GC": [random() for _ in range(5)], + } + ), +) + +# flank +flanked_gr = gr.flank(width=10, start=False, both=True) + +# shift +shifted_gr = gr.shift(shift=10) + +# resize +resized_gr = gr.resize(width=10, fix="end", ignore_strand=True) + +# narrow +narrow_gr = gr.narrow(end=1, width=1) + +# promoters +prom_gr = gr.promoters() + +# restrict +restrict_gr = gr.restrict(start=114, end=140, keep_all_ranges=True) + +# trim +trimmed_gr = gr.trim() +``` + +## Inter-range methods + +- **range**: Returns a new `GenomicRanges` object containing range bounds for each distinct (seqname, strand) pair. +- **reduce**: returns a new `GenomicRanges` object containing reduced bounds for each distinct (seqname, strand) pair. +- **gaps**: Finds gaps in the `GenomicRanges` object for each distinct (seqname, strand) pair. +- **disjoin**: Finds disjoint intervals across all locations for each distinct (seqname, strand) pair. + +```{python} +gr = GenomicRanges( + seqnames=[ + "chr1", + "chr2", + "chr3", + "chr2", + "chr3", + ], + ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]), + strand=["*", "-", "*", "+", "-"], + mcols=BiocFrame( + { + "score": range(0, 5), + "GC": [random() for _ in range(5)], + } + ), +) + +# range +range_gr = gr.range() + +# reduce +# reduced_gr = gr.reduce(min_gap_width=3, with_reverse_map=True) + +# gaps +gapped_gr = gr.gaps(start=103) # OR +gapped_gr = gr.gaps(end={"chr1": 120, "chr2": 120, "chr3": 120}) + +# disjoin +disjoin_gr = gr.disjoin() + +print(disjoin_gr) +``` + +## Set operations on genomic ranges + +- **union**: Compute the `union` of intervals across object. +- **intersect**: Compute the `intersection` or finds overlapping intervals. +- **setdiff**: Compute `set` `difference`. + +```{python} +g_src = GenomicRanges( + seqnames = ["chr1", "chr2", "chr1", "chr3", "chr2"], + ranges = IRanges(start =[101, 102, 103, 104, 109], width=[112, 103, 128, 134, 111]), + strand = ["*", "-", "*", "+", "-"] +) + +g_tgt = GenomicRanges( + seqnames = ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"], + ranges = IRanges(start =range(101, 111), width=range(121, 131)), + strand = ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"] +) +``` + +```{python} +# intersection +int_gr = g_src.intersect(g_tgt) + +# set diff +diff_gr = g_src.setdiff(g_tgt) + +# union +union_gr = g_src.union(g_tgt) + +print(union_gr) +``` + +## Compute over bins + +### Summary stats for column + +Use Pandas for computing summary statistics for a column: +```{python} +pd.Series(gr.mcols.get_column("score")).describe() +``` + +### `binned_average` + +Compute binned average for different positions: + +```{python} +bins = pd.DataFrame({"seqnames": ["chr1"], "starts": [101], "ends": [109],}) + +bins_gr = GenomicRanges.from_pandas(bins) + +subject = GenomicRanges( + seqnames= ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"], + ranges=IRanges(range(101, 111), range(121, 131)), + strand= ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"], + mcols=BiocFrame({ + "score": range(0, 10), + }) +) + +# Compute binned average +binned_avg_gr = subject.binned_average(bins=bins_gr, scorename="score", outname="binned_score") +print(binned_avg_gr) +``` + +::: {.callout-tip} +Now you might wonder how can I generate these ***bins***? +::: + +### Generate tiles or bins from `GenomicRanges` + +- **tile**: Splits each genomic region by **n** (number of regions) or by **width** (maximum width of each tile). +- **sliding_windows**: Generates sliding windows within each range, by **width** and **step**. + +```{python} +gr = GenomicRanges( + seqnames=[ + "chr1", + "chr2", + "chr3", + "chr2", + "chr3", + ], + ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]), + strand=["*", "-", "*", "+", "-"], + mcols=BiocFrame( + { + "score": range(0, 5), + "GC": [random() for _ in range(5)], + } + ), +) + +# tiles +tiles = gr.tile(n=2) + +# slidingwindows +tiles = gr.sliding_windows(width=10) +print(tiles) +``` + +### Generate tiles from Genome + +`tile_genome` returns a set of genomic regions that form a partitioning of the specified genome. + +```{python} +seqlengths = {"chr1": 100, "chr2": 75, "chr3": 200} + +tiles = GenomicRanges.tile_genome(seqlengths=seqlengths, n=10) +print(tiles) +``` + +### Coverage + +Computes number of ranges that overlap for each position in the range. + +```{python} +import rich + +res_vector = gr.coverage(shift=10, width=5) +rich.print(res_vector) +``` + +## Overlap based methods + +- **find_overlaps**: Find overlaps between two `GenomicRanges` objects. +- **count_overlaps**: Count overlaps between two `GenomicRanges` objects. +- **subset_by_overlaps**: Subset a `GenomicRanges` object if it overlaps with the ranges in the query. + +```{python} +subject = GenomicRanges( + seqnames= ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"], + ranges=IRanges(range(101, 111), range(121, 131)), + strand= ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"], + mcols=BiocFrame({ + "score": range(0, 10), + }) +) + +df_query = pd.DataFrame( + {"seqnames": ["chr2",], "starts": [4], "ends": [6], "strand": ["+"]} +) + +query = GenomicRanges.from_pandas(df_query) + +# find Overlaps +res = subject.find_overlaps(query, query_type="within") + +# count Overlaps +res = subject.count_overlaps(query) + +# subset by Overlaps +res = subject.subset_by_overlaps(query) + +print(res) +``` + +## Search operations + +- **nearest**: Performs nearest neighbor search along any direction (both upstream and downstream). +- **follow**: Performs nearest neighbor search only along downstream. +- **precede**: Performs nearest neighbor search only along upstream. + +```{python} +find_regions = GenomicRanges( + seqnames= ["chr1", "chr2", "chr3"], + ranges=IRanges([200, 105, 1190],[203, 106, 1200]), +) + +query_hits = gr.nearest(find_regions) + +query_hits = gr.precede(find_regions) + +query_hits = gr.follow(find_regions) + +print(query_hits) +``` + +::: {.callout-note} +Similar to `IRanges` operations, these methods typically return a list of indices from `subject` for each interval in `query`. +::: + +## Comparison, rank and order operations + +- **match**: Element-wise comparison to find exact match intervals. +- **order**: Get the order of indices for sorting. +- **sort**: Sort the `GenomicRanges` object. +- **rank**: For each interval identifies its position is a sorted order. + +```{python} +# match +query_hits = gr.match(gr[2:5]) +print("matches: ", query_hits) + +# order +order = gr.order() +print("order:", order) + +# sort +sorted_gr = gr.sort() +print("sorted:", sorted_gr) + +# rank +rank = gr.rank() +print("rank:", rank) +``` + +## Combine `GenomicRanges` objects by rows + +Use the `combine` generic from [biocutils](https://github.com/BiocPy/generics) to concatenate multiple `GenomicRanges` objects. + +```{python} +from biocutils.combine import combine +a = GenomicRanges( + seqnames=["chr1", "chr2", "chr1", "chr3"], + ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]), + strand=["-", "+", "*", "+"], + mcols=BiocFrame({"score": [1, 2, 3, 4]}), +) + +b = GenomicRanges( + seqnames=["chr2", "chr4", "chr5"], + ranges=IRanges([3, 6, 4], [30, 50, 60]), + strand=["-", "+", "*"], + mcols=BiocFrame({"score": [2, 3, 4]}), +) + +combined = combine(a,b) +print(combined) +``` + +## Misc operations + +- **invert_strand**: flip the strand for each interval +- **sample**: randomly choose ***k*** intervals + +```{python} +# invert strand +inv_gr = gr.invert_strand() + +# sample +samp_gr = gr.sample(k=4) +``` + +## Construct a `GenomicRangesList` object. + +Just as it sounds, a `GenomicRangesList` is a named-list like object. + +If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple +genomic elements, usually where the genes start and end. Genes are themselves made of many sub +regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure. + +Currently, this class is limited in functionality, purely a read-only class with basic accessors. + +***Note: This is a work in progress and the functionality is limited.*** + +```{python} +from genomicranges import GenomicRangesList +a = GenomicRanges( + seqnames=["chr1", "chr2", "chr1", "chr3"], + ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]), + strand=["-", "+", "*", "+"], + mcols=BiocFrame({"score": [1, 2, 3, 4]}), +) + +b = GenomicRanges( + seqnames=["chr2", "chr4", "chr5"], + ranges=IRanges([3, 6, 4], [30, 50, 60]), + strand=["-", "+", "*"], + mcols=BiocFrame({"score": [2, 3, 4]}), +) + +grl = GenomicRangesList(ranges=[a,b], names=["gene1", "gene2"]) +print(grl) +``` + + +## Properties + +```{python} +grl.start +grl.width +``` + +## Combine `GenomicRangeslist` object + +Similar to the combine function from `GenomicRanges`, + +```{python} +grla = GenomicRangesList(ranges=[a], names=["a"]) +grlb = GenomicRangesList(ranges=[b, a], names=["b", "c"]) + +# or use the combine generic +from biocutils.combine import combine +cgrl = combine(grla, grlb) +``` + +and that's all for now! Check back later for more updates. \ No newline at end of file diff --git a/chapters/representations/iranges.qmd b/chapters/representations/iranges.qmd new file mode 100644 index 0000000..7ccf0dd --- /dev/null +++ b/chapters/representations/iranges.qmd @@ -0,0 +1,177 @@ +# `IRanges`: Interval arithmetic + +Python implementation of the [**IRanges**](https://bioconductor.org/packages/IRanges) Bioconductor package. + + +## Installation +To get started, install the package from [PyPI](https://pypi.org/project/IRanges/) + +```bash +pip install iranges +``` + +::: {.callout-note} +The descriptions for some of these methods come from the [Bioconductor documentation](https://bioconductor.org/packages/release/bioc/html/IRanges.html). +::: + +## Construction + +An `IRanges` holds a **start** position and a **width**, and is most typically used to represent coordinates along some genomic sequence. The interpretation of the start position depends on the application; for sequences, the start is usually a 1-based position, but other use cases may allow zero or even negative values (e.g. circular genomes). + +```{python} +from iranges import IRanges + +starts = [-2, 6, 9, -4, 1, 0, -6, 10] +widths = [5, 0, 6, 1, 4, 3, 2, 3] +ir = IRanges(starts, widths) + +print(ir) +``` + +## Accessing properties + +Properties can be accessed directly from the object: + +```{python} +print("Number of intervals:", len(ir)) + +print("start positions:", ir.get_start()) +print("width of each interval:", ir.get_width()) +print("end positions:", ir.get_end()) +``` + +::: {.callout-tip} +Just like BiocFrame, these classes offer both functional-style and property-based getters and setters. +::: + +```{python} +print("start positions:", ir.start) +print("width of each interval:", ir.width) +print("end positions:", ir.end) +``` + +## Reduced ranges (Normality) + +`reduce` method reduces the intervals to an `IRanges` where the intervals are: + +- not empty +- not overlapping +- ordered from left to right +- not even adjacent (i.e. there must be a non empty gap between 2 consecutive ranges). + +```{python} +reduced = ir.reduce() +print(reduced) +``` + +## Overlap operations + +`IRanges` uses [nested containment lists](https://github.com/pyranges/ncls) under the hood to perform fast overlap and search-based operations. + +```{python} +subject = IRanges([2, 2, 10], [1, 2, 3]) +query = IRanges([1, 4, 9], [5, 4, 2]) + +overlap = subject.find_overlaps(query) +print(overlap) +``` + +### Finding neighboring ranges + +The `nearest`, `precede` or `follow` methods finds the nearest overlapping range along the specified direction. + +```{python} +query = IRanges([1, 3, 9], [2, 5, 2]) +subject = IRanges([3, 5, 12], [1, 2, 1]) + +nearest = subject.nearest(query, select="all") +print(nearest) +``` + +::: {.callout-note} +These methods typically return a list of indices from `subject` for each interval in `query`. +::: + +### coverage + +The `coverage` method counts the number of overlaps for each position. + +```{python} +cov = subject.coverage() +print(cov) +``` + + +## Transforming ranges + +`shift` adjusts the start positions by their **shift**. + +```{python} +shifted = ir.shift(shift=10) +print(shifted) +``` + +Other range transformation methods include `narrow`, `resize`, `flank`, `reflect` and `restrict`. For example `narrow` supports the adjustment of `start`, `end` and `width` values, which should be relative to each range. + +```{python} +narrowed = ir.narrow(start=4, width=2) +print(narrowed) +``` + +### Disjoin intervals + +Well as the name says, computes disjoint intervals. + +```{python} +disjoint = ir.disjoin() +print(disjoint) +``` + +### `reflect` and `flank` + +`reflect` reverses each range within a set of common reference bounds. + +```{python} +starts = [2, 5, 1] +widths = [2, 3, 3] +x = IRanges(starts, widths) +bounds = IRanges([0, 5, 3], [11, 2, 7]) + +res = x.reflect(bounds=bounds) +print(res) +``` + +`flank` returns ranges of a specified width that flank, to the left (default) or right, each input range. One use case of this is forming promoter regions for a set of genes. + +```{python} +starts = [2, 5, 1] +widths = [2, 3, 3] +x = IRanges(starts, widths) + +res = x.flank(2, start=False) +print(res) +``` + +## Set operations + +`IRanges` supports most interval set operations. For example, to compute `gaps`: + +```{python} +gaps = ir.gaps() +print(gaps) +``` + +Or Perform interval set operations, e..g `union`, `intersection`, `disjoin`: + +```{python} +x = IRanges([1, 5, -2, 0, 14], [10, 5, 6, 12, 4]) +y = IRanges([14, 0, -5, 6, 18], [7, 3, 8, 3, 3]) + +intersection = x.intersect(y) +print(intersection) +``` + +## Further reading + +- [IRanges reference](https://biocpy.github.io/IRanges/api/iranges.html#iranges-package) +- [Bioc/IRanges](https://bioconductor.org/packages/release/bioc/html/IRanges.html)