-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demographic table is sometimes blank in non-debug environments #76
Comments
@achasmita which dataset is this against? I think I shared a stage dataset snapshot with you last week. |
Yes I am using the stage dataset snapshot. |
And what do you see on the staging site? Again, it would be good to outline all the steps of what you did to reproduce the problem and how well they worked. |
Is the stage dataset snapshot you shared is the one we are using in staging, because I cannot see UUID table on my side, and also the other table has less row/pages compared to staging environment. |
@achasmita the dataset for which you couldn't see the UUID table was the durham dataset. I also shared the stage dataset with you separately when we were debugging the previous issues (file stage-snapshot-test-dashboard.tar.gz). Note also that even in the dev environment, 9 of the rows are blank. Maybe you can start by debugging why that is happening? |
I would like some confirmation beyond the DB string that this is in fact the same dataset.
|
I will try to remove volume and reload data again. |
It is still same. |
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aa33f3331ecb e-mission/opdash:0.0.1 "/bin/bash /usr/src/…" 2 hours ago Up 2 hours 0.0.0.0:8050->8050/tcp, 8080/tcp op-admin-dashboard-dashboard-1
f67b21700790 mongo:4.4.0 "docker-entrypoint.s…" 6 days ago Up 2 hours 0.0.0.0:27017->27017/tcp op-admin-dashboard-db-1
$ docker exec -it op-admin-dashboard-dashboard-1 /bin/bash
root@aa33f3331ecb:/usr/src/app# source setup/activate.sh
(emission) root@aa33f3331ecb:/usr/src/app# ./e-mission-py.bash
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32)
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import emission.core.get_database as edb
Connecting to database URL mongodb://db/openpath_stage
>>> import pandas as pd
>>> all_survey_entries = list(edb.get_timeseries_db().find({"metadata.key": "manual/demographic_survey"}))
>>> all_survey_entries[0]["_id"]
ObjectId('62a20b7caa2dde2114165cb0')
>>> all_survey_entries[1]["_id"]
ObjectId('62a23791aa2dde660933e539')
>>> all_survey_entries[2]["_id"]
ObjectId('62a3df3faa2dde2114166d4b')
>>> all_survey_entries[3]["_id"]
ObjectId('62a3df3daa2dde2114166b25')
>>> all_survey_entries[4]["_id"]
ObjectId('62a520aeaa2dde21141678c6')
>>> all_survey_entries[5]["_id"]
ObjectId('62a523beaa2dde21141683d5')
>>> all_survey_entries[6]["_id"]
ObjectId('62a52430aa2dde21141692c7')
>>> all_survey_entries[7]["_id"]
ObjectId('62aba8deaa2dde64df7cef73')
>>> all_survey_entries[8]["_id"]
ObjectId('62abd399aa2dde64df7d076c')
>>> all_survey_entries[9]["_id"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> all_survey_entries[9]["_id"] |
Checking columns in both environment:
Extra columns in staging environment: |
I checked individual data for similar columns in both environment and they are all same. Both the table have matching data. |
What is "both environment"? Are you saying that openpath-stage.nrel.gov and your local dev environment have the same data? Because I would like to see proof if that is the case. |
@achasmita so what do you plan to do next? |
Added some log statements in code : logging.debug("After modifying, df columns are %s" % df.columns)
logging.debug("The data in column is %s" %df.head(5))
logging.debug("The data in column is %s" %df["Which_one_below_describe_you_b"]) |
DEBUG:root:After modifying, df columns are Index(['_id', 'user_id', 'How_old_are_you', 'What_is_your_gender',
'Do_you_have_a_driver_license', 'Are_you_a_student',
'What_is_the_highest_grade_or_d', 'Do_you_work_for_either_pay_or_',
'Which_one_below_describe_you_b', 'Do_you_own_or_rent_your_home',
'What_is_your_home_type', 'Please_identify_which_category',
'Including_yourself_how_many_p', 'Including_yourself_how_many_w', 'Including_yourself_how_many_c', 'Including_yourself_how_many_p_001',
'How_many_vehicles_are_owned_l', 'If_you_were_unable_to_use_your',
'Do_you_have_a_condition_or_han', 'How_long_you_had_this_conditio',
'Do_you_have_more_than_one_job', 'Do_you_work_full_time_or_part_',
'Which_best_describes_your_prim', 'At_your_primary_job_do_you_ha',
'Do_you_have_the_option_of_work', 'How_many_days_do_you_usually_w_001',
'What_days_of_the_week_do_you_t', 'How_did_you_usually_get_to_you',
'What_is_your_typical_access_eg', 'data.ts', 'data.fmt_time',
'data.local_dt.year', 'data.local_dt.month', 'data.local_dt.day',
'data.local_dt.hour', 'data.local_dt.minute', 'data.local_dt.second',
'data.local_dt.weekday', 'data.local_dt.timezone'],
dtype='object')
DEBUG:root:finished querying values for ['manual/demographic_survey'], count = 9
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:orig_ts_db_matches = 9, analysis_ts_db_matches = 0 INFO:werkzeug:172.18.0.1 - - [20/Oct/2023 21:19:40] "POST /_dash-update-component HTTP/1.1" 200 -
DEBUG:root:The data in column is _id ... data.local_dt.timezone
0 62a20b7caa2dde2114165cb0 ... America/Los_Angeles
1 62a23791aa2dde660933e539 ... America/Los_Angeles
2 62a3df3faa2dde2114166d4b ... America/Los_Angeles
3 62a3df3daa2dde2114166b25 ... America/Los_Angeles
4 62a520aeaa2dde21141678c6 ... America/Los_Angeles
[5 rows x 39 columns]
DEBUG:root:The data in column is 0
1 i_m_not_able_to_work__due_to_reasons_lik
2
3
4 i_m_temporarily_absent_from_a_job_now
5
6
7
8
Name: Which_one_below_describe_you_b, dtype: object |
@achasmita just to confirm, you are still seeing the surveys with blank columns on staging, correct? It is just that those entries are not in your local dev version? That seems a bit weird because we know, from the public dashboard, that there should be 59 users Are you seeing 59 entries on dev? |
Yes I can see those blank columns on staging and those entries are not in local dev version, it only includes 9 entries in local dev with data.fmt_time 2022-06-09 - 2022-06-15 . |
@achasmita where is the PR with the logs? |
oop I committed inside previous PR #82 |
so will there be a new PR with only this change? We don't want to make too many new changes until we have a stable version of the existing ones |
created new PR #83 |
@achasmita I also downloaded a new snapshot of the staging data and shared it with you. I would also like to understand the fundamental reason why you see only 9 users while the staging database has > 50 |
I reloaded data and getting same 9 rows. |
@achasmita I don't think you are loading the data correctly in that case. Concretely, the On dev, it ends in |
I am using the following script to load data. MONGODUMP_FILE=$1
echo "Copying file to docker container"
docker cp $MONGODUMP_FILE op-admin-dashboard-db-1:/tmp
FILE_NAME=`basename $MONGODUMP_FILE`
echo "Restoring the dump from $FILE_NAME"
docker exec -e MONGODUMP_FILE=$FILE_NAME op-admin-dashboard-db-1 bash -c 'cd /tmp && tar xvf $MONGODUMP_FILE && mongorestore' |
there are two 'ec6' in both where one ends with 'cb0' and other '76c' |
have added this log statements so that we can check the logs on staging. |
Finally!! I am able to see 111 rows. I was having trouble loading data before but after fixing memory in docker desktop it worked. Now I can see those blank rows in staging. |
while running I saw this warning /usr/src/app/app_sidebar_collapsible.py:167: UserWarning:
DataFrame columns are not unique, some columns will be omitted. |
Investigating columns in demographic table: 1 _id
2 user_id
3 How_old_are_you
4 What_is_your_gender
5 Do_you_have_a_driver_license
6 Are_you_a_student
7 What_is_the_highest_grade_or_d
8 Do_you_work_for_either_pay_or_
9 Which_one_below_describe_you_b
10 Do_you_own_or_rent_your_home
11 What_is_your_home_type
12 Please_identify_which_category
13 Including_yourself_how_many_p
14 Including_yourself_how_many_w
15 Including_yourself_how_many_c
16 Including_yourself_how_many_p_001
17 How_many_vehicles_are_owned_l
18 If_you_were_unable_to_use_your
19 Do_you_have_a_condition_or_han
20 How_long_you_had_this_conditio
21 Do_you_have_more_than_one_job
22 Do_you_work_full_time_or_part_
23 Which_best_describes_your_prim
24 At_your_primary_job_do_you_ha
25 Do_you_have_the_option_of_work
26 How_many_days_do_you_usually_w_001
27 What_days_of_the_week_do_you_t
28 How_did_you_usually_get_to_you
29 What_is_your_typical_access_eg
30 data.ts
31 data.fmt_time
32 data.local_dt.year
33 data.local_dt.month
34 data.local_dt.day
35 data.local_dt.hour
36 data.local_dt.minute
37 data.local_dt.second
38 data.local_dt.weekday
39 data.local_dt.timezone
40 How_old_are_you
41 What_is_your_gender
42 do_you_consider_yourself_to_be
43 What_is_your_race_ethnicity
44 Do_you_have_a_driver_license
45 Are_you_a_student
46 What_is_the_highest_grade_or_d
47 Are_you_a_paid_worker
48 Which_one_below_describe_you_b
49 Do_you_own_or_rent_your_home
50 What_is_your_home_type
51 Please_identify_which_category
52 Including_yourself_how_many_p
53 Including_yourself_how_many_w
54 Including_yourself_how_many_w_001
55 Including_yourself_how_many_p_001
56 How_many_motor_vehicles_are_ow
57 If_you_were_unable_to_use_your
58 Do_you_have_a_condition_or_han
59 How_long_you_had_this_conditio
60 Do_you_have_more_than_one_job
61 Do_you_work_full_time_or_part_
62 Which_best_describes_your_prim
63 Please_describe_your_primary_job
64 At_your_primary_job_do_you_ha
65 Do_you_have_the_option_of_work
66 How_many_days_do_you_usually_w_001
67 What_days_of_the_week_do_you_t
68 How_old_are_you
69 What_is_your_gender
70 do_you_consider_yourself_to_be
71 What_is_your_race_ethnicity
72 Do_you_have_a_driver_license
73 Are_you_a_student
74 What_is_the_highest_grade_or_d
75 Are_you_a_paid_worker
76 Which_one_below_describe_you_b
77 Do_you_own_or_rent_your_home
78 What_is_your_home_type
79 Please_identify_which_category
80 Including_yourself_how_many_p
81 Including_yourself_how_many_w
82 Including_yourself_how_many_w_001
83 Including_yourself_how_many_p_001
84 How_many_motor_vehicles_are_ow
85 If_you_were_unable_to_use_your
86 Do_you_have_a_condition_or_han
87 How_long_you_had_this_conditio
88 Do_you_have_more_than_one_job
89 Do_you_work_full_time_or_part_
90 Which_best_describes_your_prim
91 Please_describe_your_primary_job
92 At_your_primary_job_do_you_ha
93 Do_you_have_the_option_of_work
94 How_many_days_do_you_usually_w_001
95 What_days_of_the_week_do_you_t
96 At_your_primary_job_do_you_ha
97 Which_best_describes_your_prim
98 Do_you_work_full_time_or_part_
99 Do_you_have_the_option_of_work
100 Please_describe_your_primary_job
101 Do_you_have_more_than_one_job
102 What_days_of_the_week_do_you_t
103 How_many_days_do_you_usually_w_001
104 Which_one_below_describe_you_b
105 What_is_your_race_ethnicity
106 Are_you_a_student
107 What_is_the_highest_grade_or_d
108 do_you_consider_yourself_to_be
109 What_is_your_gender
110 How_old_are_you
111 Are_you_a_paid_worker
112 Do_you_have_a_driver_license
113 How_long_you_had_this_conditio
114 Including_yourself_how_many_w_001
115 Including_yourself_how_many_p
116 Do_you_own_or_rent_your_home
117 Please_identify_which_category
118 How_many_motor_vehicles_are_ow_001
119 Including_yourself_how_many_p_001
120 If_you_were_unable_to_use_your
121 Including_yourself_how_many_w
122 What_is_your_home_type
123 How_many_motor_vehicles_are_ow
124 Do_you_have_a_condition_or_han |
And most of these columns are repeated 4 or 3 times: 1 If_you_were_unable_to_use_your - 4
2 How_many_motor_vehicles_are_ow - 4
3 What_days_of_the_week_do_you_t - 4
4 What_is_the_highest_grade_or_d - 4
5 How_old_are_you - 4
6 Including_yourself_how_many_w -4
7 Do_you_work_full_time_or_part_ -4
8 What_is_your_gender - 4
9 Do_you_have_the_option_of_work - 4
10 do_you_consider_yourself_to_be - 3
11 How_long_you_had_this_conditio -4
12 Are_you_a_student -4
13 Do_you_have_more_than_one_job -4
14 Do_you_have_a_condition_or_han -4
15 Including_yourself_how_many_p_001 -4
16 Do_you_own_or_rent_your_home - 4
17 What_is_your_home_type - 4
18 How_many_days_do_you_usually_w_001-4
19 Please_describe_your_primary_job -3
20 Do_you_have_a_driver_license - 4
21 Which_one_below_describe_you_b-4
22 Please_identify_which_category-4
23 At_your_primary_job_do_you_ha-4
24 Which_best_describes_your_prim-4
25 Including_yourself_how_many_p-4
26 Are_you_a_paid_worker-3
27 Including_yourself_how_many_w_001-3
28 What_is_your_race_ethnicity-3 After splitting columns name to display only the question part, they will lose their unique identity: 'data.jsonDocResponse.aSfdnWs9LE6q8YEF7u9n85.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.data.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.aQhPrHNVZA6L2cxBaMeE9Y.group_hg4zz25.What_is_your_race_ethnicity' the above repeated columns become duplicate columns, I think that is why we are getting this warning: /usr/src/app/app_sidebar_collapsible.py:167: UserWarning:
DataFrame columns are not unique, some columns will be omitted. And also when this duplicate columns are omitted we are losing data on that part. |
'data.jsonDocResponse.aSfdnWs9LE6q8YEF7u9n85.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.data.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.aQhPrHNVZA6L2cxBaMeE9Y.group_hg4zz25.What_is_your_race_ethnicity' For now I am trying to change 3rd key of dictionary |
I am not sure if that is a great idea. What is the problem you are trying to solve, and why is this the correct approach for solving it? Again, our goal is not to have a bunch of hacks to "get things to work", our goal is to have a principled implementation |
On my dev environment I can see so many columns blank even when they are not empty. And I saw the warning": /usr/src/app/app_sidebar_collapsible.py:167: UserWarning:
DataFrame columns are not unique, some columns will be omitted. while running the code. When I look into the code we are doing: df.columns=[col.rsplit('.',1)[-1] if col.startswith('data.jsonDocResponse.') else col for col in df.columns] to simplify the column name. But doing this we get 3 different columns with same name:
All the above columns result in 3 columns named Therefore, if we can make all those columns name identical we can extract all those data in same column and there will be no conflict of duplicate columns. |
That is the problem. Why is this (aka "For now I am trying to change 3rd key of dictionary aSfdnWs9LE6q8YEF7u9n85, data,aQhPrHNVZA6L2cxBaMeE9Y into same key(e.g. survey_id") so that they all look similar and we get single column) the solution?
|
Yes either we can show all of this Combining them helps to solve this problem and the option I see is to make them identical. Also, I am still exploring other options. |
Problem: Duplicate column names in df, leading to data loss and blank cells in the table. Pros:
Solution 2: Include unique identity in column name while simplifying column names in df Pros:
Cons: |
Is there actually data loss? A: Yes because if there are three similar columns, we only display data from one.
But it is not clear that the survey question is "same" given that it is actually from multiple surveys For example, survey So I know that the words so IMHO, instead of "hacking" this by replacing the survey id and forcing the results to be in the same format, we should support having multiple versions of the survey over time. I would encourage you to look up examples of custom surveys at |
Other option can be: |
Bingo! That was what I had in mind as well. Program admins can then download the tables for each of their survey versions independently. And it doesn't matter if the questions are the same or different. |
Ok I will start working on that. |
Spot checking a few non-debug environments, I see several blank rows.
This is particularly egregious in staging, where the entire first page consists of blank rows except for one row with "prefer not to say". However, I also see it in Smart Commute where the first few rows are blank
or in the harvard dataset
e-mission/e-mission-docs#1000 (comment)
I have downloaded the csv and confirmed that the values are empty.
We should see if this is reproducible in a dev environment, and investigate further
The text was updated successfully, but these errors were encountered: