Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there some way to specify that certain interactions never occur? #19

Open
DWesl opened this issue Jul 20, 2020 · 10 comments
Open

Is there some way to specify that certain interactions never occur? #19

DWesl opened this issue Jul 20, 2020 · 10 comments

Comments

@DWesl
Copy link

DWesl commented Jul 20, 2020

>>> import numpy as np
>>> import numpy.linalg as la
>>> import pandas as pd
>>> import formulaic
>>> index_vals = tuple("abc")
>>> level_names = list("AB")
>>> n_samples = 2
>>> ds_simple = pd.DataFrame(
...     index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]), 
...     columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
... ).reset_index()
>>> simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple)[1]
>>> # Approximate the condition number of simple_X
>>> np.divide(*la.svd(simple_X)[1][[0, -1]])
13.9282...
>>> simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
>>> np.divide(*la.svd(simple_X)[1][[0, -1]])
5.06320...e+16

I would expect the condition numbers to be somewhat closer to each other.
Is this just a bad expectation?

In the case this is derived from, A == "a" sets something to zero, and then multiplies that by the stuff B controls.
Is the standard practice for this situation to duplicate data expected to be identical so the tests don't crash?

@DWesl
Copy link
Author

DWesl commented Jul 20, 2020

I found a better description of what's going on:

import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("AB")
n_samples = 2


def describe_design_matrix(design_matrix):
    print("Shape:", design_matrix.shape)
    print("Rank: ", la.matrix_rank(design_matrix))
    print("Approximate condition number:",
          np.divide(*la.svd(design_matrix)[1][[0, -1]]))


ds_simple = pd.DataFrame(
    index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]), 
    columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()

print("All sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple)[1]
describe_design_matrix(simple_X)

print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)

print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:")]]
describe_design_matrix(reduced_X)

produces

All sampled
Shape: (18, 9)
Rank:  9
Approximate condition number: 13.928203230275503
Only some sampled
Shape: (14, 9)
Rank:  7
Approximate condition number: 5.063203904729927e+16
Reduced X
Shape: (14, 7)
Rank:  7
Approximate condition number: 9.878656395922679

I think I can turn this into something I can use for the original problem.
Is there some way I can tell the rank-reducer that some combinations of factors will not exist?

@DWesl
Copy link
Author

DWesl commented Jul 20, 2020

Just realized I should probably mention:
This is formulaic 0.1.2 on CPython 3.6.10.

@DWesl DWesl changed the title Large condition number for incomplete sampling design Is there some way to specify that certain interactions never occur? Jul 21, 2020
@DWesl
Copy link
Author

DWesl commented Jul 21, 2020

A slightly more complicated case:

import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("ABC")
n_samples = 2


def describe_design_matrix(design_matrix):
    print("Shape:", design_matrix.shape)
    print("Rank: ", la.matrix_rank(design_matrix))
    print("Approximate condition number:",
          np.divide(*la.svd(design_matrix)[1][[0, -1]]))


ds_simple = pd.DataFrame(
    index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]), 
    columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()

print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B + C) ** 3").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)

print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]]
describe_design_matrix(reduced_X)

Since formulaic appears to sort the column names before forming the interactions, a general solution would need the test to be similar to:

reduced_X = simple_X.loc[
    :, 
    [
        col for col in simple_X.columns 
        if (
            # Check for interaction terms involving A == 'b' ...
            (not col.startswith("A[T.b]:") and ":A[T.b]" not in col)
            # ... and any value of B
            or "B" not in col
        )
        # Equivalently,
        # if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and "B" in col
        # I may need to expand the test for B to avoid problems if, for example, A.endswith(B)
        # (":B[" in col or col.startswith("B[T."))
        # should work
    ]
]

I'm not entirely sure how to tell formulaic that A[T.a]:B[T.x] exists only for x = a, so it needs to move its A- definition from A[T.a]:B[T.{x != a}] (which doesn't exist) to A[T.b]:B[T.{x !=a}].

This looks like most of an implementation and a start at a test suite for this functionality if an API gets decided.
(There will be problems if someone decides to name their categorical variables "some_name" and ":some_name[T.a_val_in_some_name]:", but I'm not seeing a way around that with this approach.
I have no problem with avoiding weird names like this.)

@matthewwardrop
Copy link
Owner

matthewwardrop commented Jul 26, 2020

Hi @DWesl ! Sorry for the delay on this. I usually only work on Formulaic on the weekends, in and around family stuff.

This is a very interesting problem. The ensure_full_rank option currently only guarantees structural full-rankness. That is, it assumes that after categorical encoding, the data itself is full rank.

I'm a little reluctant to add functionality along these lines to the formula grammar itself, since I think that will complicate things a lot. However I do see two alternative paths to supporting something like this downstream (or even integrating it directly into Formulaic).

(1) As a post-model-matrix-generation transform that looks at the structure and patches it to be full rank.
(2) Implementing it as a transform.

For (2), it should be "trivial" to add this as a transform, but making it a stateful transform (something that can be reused on a different dataset) would be more tricky. I'm imagining a new transform that look something like:

model_matrix('~ rank_reduced_product( C(A), C(B), ..., options=...)') 

Obviously as the number of interactions grow, this looks more and more ugly.

What are your thoughts?

@matthewwardrop
Copy link
Owner

Oh... And you could explicitly write out the interactions:

model_matrix("{C(A)['a']}:{C(B)['b']}",...)

This syntax should work in the latest master, but with some tweaks to the parser perhaps we could enable doing:

model_matrix("A.a:B.b + ...")

If we did that, would that be sufficient for your use-case? You'd have to expand your products manually.

@DWesl
Copy link
Author

DWesl commented Jul 26, 2020

There's certainly the option of writing out the interactions explicitly, but I've worked around this for now by using:

reduced_X = simple_X.loc[
    :, 
    [
        col for col in simple_X.columns 
        if (
            # Check for interaction terms involving A == 'b' ...
            if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and
            # I need to expand the test for B to avoid problems if, for example, A.endswith(B)
            (":B[" in col or col.startswith("B[T."))
        )
    ]
]

(The problem that sparked this issue has the second column named essentially BA, which requires a bit more care than just B)

I would not add it to the grammar, I was thinking a keyword option to the Formula.get_model_matrix method, since this kind of thing very much depends on the exact way the data was collected.

Possibly a better (more consistent) way to do this would be to use the grammar to set B[T.b] as the baseline and removing the interaction terms which are always zero. This could be really easy to do in CSC sparse format (if the next column starts at the same place as the current column, there's no nonzero elements), and np.count_nonzero(..., axis=0) > 0 should give an index that could be used to drop columns in everything else.

... that's not currently implemented in formulaic, but the idea should be sound.

I have no idea which of these better conforms to the expectations of people looking at linear regression coefficients or ANOVA tables.

@matthewwardrop
Copy link
Owner

Sorry for the delay here again @DWesl ! I've had a lot on my plate the last few weeks.

I think I am persuaded that adding a keyword argument along the lines of "drop_columns_with_no_support" or "drop_all_zero_columns" could be useful, and that having be part of the model matrix generation process is valuable because that way new data will be forced to respect the limited interactions even if the new data has support for these interactions.

Would that meet your use-case?

@DWesl
Copy link
Author

DWesl commented Aug 30, 2020

That gets most of the way there; I'd need to reorder the variables in the Categorical to get a column with all zeros, but that's not that much effort.

@DWesl
Copy link
Author

DWesl commented Aug 30, 2020

I asked about this in pydata/patsy#161, and puzzled out what formula would be needed to ensure the all-zero columns happen: (C(A, Treatment('b')) + B + C) ** 3. I don't think formulaic supports changing the reference level yet, but if it respects the order of the levels in a pandas.Categorical dtype, I should be able to achieve the same effect by changing the order there.

@matthewwardrop
Copy link
Owner

Hi @DWesl !

As I prepare formulaic for the v1.0 release, I wanted to go back through all the old issues and make sure things were dealt with properly. I know it has been a really long time (sorry). Formulaic has come a long way during this time, but I don't think this specific issue has been "solved".

Formulaic does now support customizing the reference level in contrasts/etc, as well as explicitly ordering the levels and respecting the order of factors in the formula, so you could use what you wrote above... but, there's still room for improvement.

Has your thinking in this space evolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants