Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WISE FR]: More stat's in multi-band TIF files #131

Open
2 of 4 tasks
RobBryce opened this issue Sep 19, 2022 · 18 comments
Open
2 of 4 tasks

[WISE FR]: More stat's in multi-band TIF files #131

RobBryce opened this issue Sep 19, 2022 · 18 comments
Assignees
Labels
Approved Approved for Action Attribution Required feature This is a feature request Outside Contribution

Comments

@RobBryce
Copy link
Collaborator

RobBryce commented Sep 19, 2022

Contact Details

rbryce@heartlandsoftware.ca

What is needed?

Currently, multi-band TIF files contain:

  • accumulated band
  • count band
  • mean band (accumulated / count)
  • minimum value band
  • maximum value band

The request here is to add bands for standard deviation and standard error, to improve post-processing analysis.
There will be no cost to the WISE team for this work, it'll be a contribution.

How will this improve the project or tool?

More complete statistics available to post-analysis tools.

TODO

@RobBryce RobBryce added Triage Issue needs triage feature This is a feature request labels Sep 19, 2022
@BadgerOnABike
Copy link
Collaborator

I'll continue to rail that we should be including the median. As many of these variables, the mean is relatively meaningless as the data is not normal.

@RobBryce
Copy link
Collaborator Author

@BadgerOnABike Summary of discussion privately:
The concern of providing median from within the WISE executable is the memory required to calculate it. Specifically, if there are 100 sub-scenarios to combine, then we need to store 100 layers of sub-scenarios, for each output we want a median for, to be able to compute it - lots of memory 10's of GB's or RAM.

In addition to this, we have tools to combine these multi-layer grids later on, so that scenarios (not just sub-scenarios) can be combined in meaningful manners. At this point, these multi-band TIF's do not store any individual sub-scenario outputs, just the results of the combinations. To combine many multi-layer grids and then compute a real median (not median of medians), then all sub-scenario outputs need stored for later re-analysis. That will produce very large multi-band TIF files to recombine. However, it would provide all data to look at each sub-scenario individually (which sort of defeats the purpose of a sub-scenario).

@BadgerOnABike
Copy link
Collaborator

Perhaps we need to pour over the code of Burn-P3 because it allows us to acquire any percentile we desire and doesn't consume all RAM until many hundreds of thousands of runs are being completed. Additionally, if we aren't providing a full suite of statistics with the sub scenarios, their utility arguably goes down as we cannot determine anything about the distribution from the mean considering the normality assumption of that statistic isn't met. That is of course assuming that BP3 is calculating it correctly and isn't acquiring it from some other way?

@RobBryce
Copy link
Collaborator Author

From memory (since I have audited that code in the past, and the code may have been updated):

BurnP3 uses a variable-sized (auto-sized) array per grid cell, so it loses information of which fire provided which value (which isn't important for median calculations). This potentially reduces overall memory usage but at the expense over more overhead per cell, and slower insertions of data, and a whole lot of memory fragmentation. It is a viable approach for in-memory, though, particularly if fires are relatively small w/r to the overall dimensions of your plot. And I recall reporting an issue where the median for RAZ was not being calculated correctly (treated as linear data rather than circular, but I don't recall if you care about median RAZ).

WISE doesn't limit the stat's to the closed set that BurnP3 does. And the BurnP3 approach would need a non-standard file format to export this data to calculate medians of combined datasets.

You don't need to retain the complete dataset to calculate any of mean, standard deviation, or standard error.

@RobBryce
Copy link
Collaborator Author

RobBryce commented Nov 30, 2022

I'm ready to merge this (standard deviation and standard error) back in, ready for evaluation. Alberta Parks contribution.

@RobBryce RobBryce added Needs Approval Needs approval to proceed with work, after review of plan/estimate/quotation Outside Contribution labels Nov 30, 2022
@spydmobile spydmobile added Attribution Required and removed Triage Issue needs triage labels Nov 30, 2022
@spydmobile spydmobile added Approved Approved for Action and removed Needs Approval Needs approval to proceed with work, after review of plan/estimate/quotation labels Dec 9, 2022
@spydmobile
Copy link
Collaborator

@RobBryce what is the status of this work? Is it complete?

@RobBryce RobBryce assigned spydmobile and unassigned RobBryce Feb 13, 2023
@RobBryce
Copy link
Collaborator Author

Standard deviation and standard error stat's were added and received a lot of testing. Outside validation may not hurt once others are generating sub-scenarios.
No work on median values has been performed since that wasn't part of the original ticket text, and budget for this work was limited to std dev and std err.

@BadgerOnABike
Copy link
Collaborator

Am I correct in thinking this is when fires are burned iteratively and we are calculating the mean by pixel across a range of weather parameters or is the mean / sd / se coming form another place? I'm unclear as to how I would perform testing of these metrics, though I am interested in doing so.

@RobBryce
Copy link
Collaborator Author

The output of a scenario TIF file (with sub-scenarios) is a multi-band TIF. We only added a few more bands for sd/se. Existing bands were listed above.
An export from a regular scenario is single-band. An export from a scenario with even one sub-scenario is multi-band.
Sub-scenarios may have different weather, or other different parameters too. But for our work, it is typically iterating through weather.

@BadgerOnABike
Copy link
Collaborator

I get that part, I'm curious what is being averaged here. Multiple scenarios / subscenarios is how I'm understanding it, is that correct?

@RobBryce
Copy link
Collaborator Author

Yes, for whatever stat is requested

@BadgerOnABike
Copy link
Collaborator

Alright, then I'd be able to fairly easily replicate. I presume then for making means you're simply adding to divide by the number of scenarios at the end.

For standard deviation you would require all the layers to subtract from the mean. Wouldn't we then have the same data required for median?

@RobBryce
Copy link
Collaborator Author

RobBryce commented Feb 13, 2023

We are using Welford's method, identified https://stackoverflow.com/questions/895929/how-do-i-determine-the-standard-deviation-stddev-of-a-set-of-values, which also has links to https://www.johndcook.com/blog/standard_deviation/. We don't need to store the complete dataset for these stat's. This way, a known change to memory consumption occurs, where-as if we are storing all data from all simulations, we cannot necessarily predict memory consumption.

@BadgerOnABike
Copy link
Collaborator

Interesting, I do see some methods to calculate a rolling median as well. I'll continue my search, until then I think what we have should work. I guess I'll find out when I go to test them again!

@spydmobile spydmobile assigned RobBryce and unassigned spydmobile Feb 16, 2023
@spydmobile
Copy link
Collaborator

@RobBryce is the original work (not the Median) Completed? If so, Is this ready for testing? If so assign it to @BadgerOnABike and label it "Needs Testing". Otherwise this is outstanding, as this was a contribution. Also, this will need some kind of attribution which we need to resolve before we can close this.

@RobBryce
Copy link
Collaborator Author

The original work has been used for a while now. Once @BadgerOnABike can run projects, he can validate. Or, we can provide outputs. Either way we had to validate it some months ago. @lizzydchappy can provide specifics, but I believe attribution should go to Alberta Parks.

@RobBryce RobBryce assigned BadgerOnABike and unassigned RobBryce Feb 17, 2023
@BadgerOnABike
Copy link
Collaborator

Answer to the question of "Does the median matter in HFI"

TLDR: Yes

I summarised data for multiple decades of Alberta fire weather history in 3 ways. Everything, everything in June, everything in June at station C3, they all show the same general trend. Massively 0 inflated data yielding a negatively exponential distribution. This will matter most when performing models with stochastic weather information or running the system in a mode to determine any kind of probabilistic output.

All data:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 86.28 1920.72 4832.14 7003.20 206482.70

image

June:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 42.06 949.31 4273.83 6263.56 132813.10

image

June at C3:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.59 320.28 2273.88 2532.01 47268.11

image

@RobBryce
Copy link
Collaborator Author

That's great work. Now we know. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Approved Approved for Action Attribution Required feature This is a feature request Outside Contribution
Projects
None yet
Development

No branches or pull requests

4 participants