Skip to content

Latest commit

 

History

History
166 lines (129 loc) · 5.32 KB

README.md

File metadata and controls

166 lines (129 loc) · 5.32 KB

Earth System Data Science (ESDS)

A repository of examples of using different statistical and machine learning algorithms (mostly in R) in hydropedology

Why data science?

What is data science?

Tools for data science

I'll largely be focused on using R.

About the data

Hydrology data

A combination of USGS stream discharge, landscape, and climate data.

Stream data -- National Water Information System

https://help.waterdata.usgs.gov/ https://owi.usgs.gov/R/dataRetrieval.html

Landscape data -- GAGES-II

https://www.sciencebase.gov/catalog/item/59692a64e4b0d1f9f05fbd39

Climate data -- PRISM

http://www.prism.oregonstate.edu/

Keep it simple, so focus on:

  • Precipitation
  • Mean temperature
  • Dew point temperature? Use this to get at relative humidity?

Climate data -- NRCS SNOTEL

  • How can I automate the download of these data?
  • Could I use these data to optimize a phase curve via logistic regression?

Soils data

Focus: ISRIC soils information (https://www.isric.org/) Data availability: ISRIC Soil Data Hub (https://data.isric.org)

Data manipulation

Questions of interest

  • Is there a relationship between soil attributes and climate (Koppen-Geiger)?
    • We know this from the five soil-forming factors, but can we quantify the relationship?
  • Can I tell which continent a soil came from?
  • What are the most important attributes defining a soil (relative to the data I have)? (PCA or NMDS question.)
  • Do different soil attributes influence one another? (SEM question)
  • Are mean annual temperature data from PRISM and actual station data different from each other?
    • Is there geographic bias in the errors or significant differences?
    • Pair-wise t-tests or other comparisons (Mann-Whitney)
    • Download the data from CompBio
      • Frequentist vs Bayesian methods
  • Can we predict the phase of snow using air temperature and other environmental data?
    • This is a classification problem that could be addressed with logistic regression and SVM.
  • Are there significant trends in annual discharge over time?
    • Linear regression
    • Map out the slope of significant trends across the US.
      • Use leaflet and clickable links to see individual annual hydrographs marked with a colored trend line and highlighting abnormal years using the emperical density function.
    • Include both Frequentist and Bayesian forms of the analysis.
  • Is there a relationship between annual discharge, temperature, snow, elevation, etc?
    • Multiple linear regression
  • What role do different landscape features have on the above relationships?
    • Could use the GAGES-II data set for this
    • Hierarchical multiple linear regression
    • Frequentist and Bayesian
  • Are their "natural" groups of discharge sensitivity (represented by the steepness of the slope)?
    • Discriminant analysis

Algorithms to investigate

How should I organize these algorithms? By Data type output? (This will help me figure out how to organize the site.)

  • Data types

    • Categorical
      • Nominal (Categories with no obvious relationship)
      • Ordinal (Categories in which order does matter)
    • Numerical
      • Interval (Integer data that maintain the same distance from each other -- -5, 0, 5, 10)
      • Ratio
  • Further attributes to consider

    • Data output type
    • Data input type
    • Parameter type
      • Single
      • Multiple
        • Mixed (categorical and numerical)

The algorithms

Uncertainty

Other topics that don't fit neatly into the space above.

  • Leave-one-out cross validation
  • k-folds cross validation

More about machine learning in R

Statistical learning resources