Skip to content

Commit

Permalink
Add sections on IRanges and GenomicRanges (#2)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche authored Jan 5, 2024
1 parent e957081 commit 273dacb
Show file tree
Hide file tree
Showing 5 changed files with 763 additions and 30 deletions.
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ book:
- part: chapters/representations/index.qmd
chapters:
- chapters/representations/biocframe.qmd
- chapters/representations/iranges.qmd
- chapters/representations/genomicranges.qmd
- chapters/summary.qmd
- chapters/references.qmd

Expand Down
53 changes: 23 additions & 30 deletions chapters/representations/biocframe.qmd
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# `BiocFrame` - Bioconductor-like data frames

This package implements the BiocFrame class, a Bioconductor-friendly alternative to Pandas DataFrame. The main advantage is that the BiocFrame makes no assumption on the types of the columns - as long as an object has a length (`__len__`) and slicing methods (`__getitem__`), it can be used inside a `BiocFrame`.
`BiocFrame` class is a Bioconductor-friendly alternative to Pandas `DataFrame`. Its key advantage lies in not making assumptions on the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`.

This allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects.
This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects.

## Installation

Package is published to [PyPI](https://pypi.org/project/biocframe/)
To get started, install the package from [PyPI](https://pypi.org/project/biocframe/)

```bash
pip install biocframe
```

## Construction

To construct a `BiocFrame` object, simply provide the data as a dictionary.
To create a `BiocFrame` object, simply provide the data as a dictionary.

```{python}
from biocframe import BiocFrame
Expand All @@ -29,7 +29,7 @@ print(bframe)

::: {.callout-tip}
You can specify complex objects as columns, as long as they have some "length" equal to the number of rows.
For example, we can nest a `BiocFrame` inside another `BiocFrame`:
For example, we can embed a `BiocFrame` within another `BiocFrame`:
:::


Expand All @@ -50,22 +50,18 @@ print(bframe2)

## Extracting data

Properties can be accessed directly from the object:
Properties can be directly accessed from the object:

```{python}
print("shape:", bframe.shape)
print("column names (functional style):", bframe.get_column_names())
print("column names (as property):", bframe.column_names) # same as above
```

We can fetch individual columns:

```{python}
print("functional style:", bframe.get_column("ensembl"))
print("w/ accessor", bframe["ensembl"])
```

Expand All @@ -75,9 +71,9 @@ And we can get individual rows as a dictionary:
bframe.get_row(2)
```

::: {.callout-important}
To extract a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator.
This accepts different subsetting arguments like a boolean vector, a `slice` object, a sequence of indices, or row/column names.
::: {.callout}
To retrieve a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator.
This operator accepts different subsetting arguments, such as a boolean vector, a `slice` object, a sequence of indices, or row/column names.
:::

```{python}
Expand All @@ -95,8 +91,7 @@ print("\nShort-hand to get a single column: \n", bframe["ensembl"])

### Preferred approach

To set `BiocFrame` properties, we encourage a **functional style** of programming that avoids mutating the object.
This avoids inadvertent modification of `BiocFrame`s that are part of larger data structures.
To set `BiocFrame` properties, we encourage a **functional style** of programming that avoids mutating the object. This avoids inadvertent modification of `BiocFrame` instances within larger data structures.

```{python}
modified = bframe.set_column_names(["column1", "column2"])
Expand All @@ -122,17 +117,18 @@ print(modified)

Change the row or column names:

::: {.callout-note}
The functional style allows you to chain multiple operations as in the example below.
:::

```{python}
modified = bframe.\
set_column_names(["FOO", "BAR"]).\
set_row_names(['alpha', 'bravo', 'charlie'])
print(modified)
```


::: {.callout-tip}
The functional style allows you to chain multiple operations.
:::

We also support Bioconductor's metadata concepts, either along the columns or for the entire object:

```{python}
Expand All @@ -144,8 +140,7 @@ print(modified)

### The other way

Properties can also be set by direct assignment for in-place modification.
We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures.
Properties can also be set by direct assignment for in-place modification. We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures.
Nonetheless:

```{python}
Expand All @@ -154,7 +149,7 @@ testframe.column_names = ["column1", "column2" ]
print(testframe)
```

::: {.callout-caution}
::: {.callout-important}
Warnings are raised when properties are directly mutated. These assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`.
It is best to do this only if the `BiocFrame` object is not being used anywhere else;
otherwise, it is safer to just create a (shallow) copy via the default `in_place = False`.
Expand All @@ -169,8 +164,7 @@ testframe[1:3, ["column1","column2"]] = BiocFrame({"x":[4, 5], "y":["E", "F"]})

## Combining objects

**BiocFrame** implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).
So, for example, to combine by row:
`BiocFrame` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils). For example, to combine by row:

```{python}
import biocutils
Expand Down Expand Up @@ -216,7 +210,7 @@ combined = biocutils.relaxed_combine_rows(bframe1, modified2)
print(combined)
```

Similarly, if the rows are different, we can use **BiocFrame**'s `merge` function:
Similarly, if the rows are different, we can use `BiocFrame`'s `merge` function:

```{python}
from biocframe import merge
Expand All @@ -230,9 +224,7 @@ print(combined)

## Interop with pandas

`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R.
Most users will probably prefer to work with **pandas** `DataFrame` objects for their actual analyses.
This conversion is easily achieved:
`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R. Most users will probably prefer to work with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved:

```{python}
from biocframe import BiocFrame
Expand All @@ -256,14 +248,14 @@ print(out)

## Empty Frames

We can create empty `BiocFrame` objects that hold no information except the number of rows. This is useful when `BiocFrame` objects are part of larger datastructures but hold no data.
We can create empty `BiocFrame` objects that only specify the number of rows. This proves beneficial in situations where `BiocFrame` objects are integrated into more extensive data structures but do not possess any data themselves.

```{python}
empty = BiocFrame(number_of_rows=100)
print(empty)
```

Most operations described in the document can also be performed on an empty `BiocFrame` object.
Most operations described in this document can be performed on an empty `BiocFrame` object.

```{python}
print("Column names:", empty.column_names)
Expand All @@ -275,6 +267,7 @@ print("\nSubsetting an empty BiocFrame: \n", subset_empty)
::: {.callout-tip}
Similarly one can create an empty `BiocFrame` with only row names.
:::

## Further reading

Check out [the reference documentation](https://biocpy.github.io/BiocFrame/) for more details.
Expand Down
1 change: 1 addition & 0 deletions chapters/representations/genomicpacks.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
## Genomic analysis
Loading

0 comments on commit 273dacb

Please sign in to comment.