Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demographic table is sometimes blank in non-debug environments #76

Open
shankari opened this issue Oct 6, 2023 · 48 comments
Open

Demographic table is sometimes blank in non-debug environments #76

shankari opened this issue Oct 6, 2023 · 48 comments

Comments

@shankari
Copy link
Contributor

shankari commented Oct 6, 2023

Spot checking a few non-debug environments, I see several blank rows.

This is particularly egregious in staging, where the entire first page consists of blank rows except for one row with "prefer not to say". However, I also see it in Smart Commute where the first few rows are blank

Screenshot 2023-10-06 at 3 24 50 PM

or in the harvard dataset
e-mission/e-mission-docs#1000 (comment)

I have downloaded the csv and confirmed that the values are empty.

We should see if this is reproducible in a dev environment, and investigate further

@achasmita
Copy link
Contributor

achasmita commented Oct 9, 2023

I only have this dataset for demographic table.

Screen Shot 2023-10-09 at 4 38 46 PM

@shankari
Copy link
Contributor Author

@achasmita which dataset is this against? I think I shared a stage dataset snapshot with you last week.
Are you trying to run against it?

@achasmita
Copy link
Contributor

@achasmita which dataset is this against? I think I shared a stage dataset snapshot with you last week. Are you trying to run against it?

Yes I am using the stage dataset snapshot.

@shankari
Copy link
Contributor Author

And what do you see on the staging site? Again, it would be good to outline all the steps of what you did to reproduce the problem and how well they worked.

@achasmita
Copy link
Contributor

I can see lot of blank rows in staging environment. while in dev environment there are only completed 9 rows of these data.
Screen Shot 2023-10-10 at 7 25 29 AM
Screen Shot 2023-10-10 at 7 26 11 AM
Screen Shot 2023-10-10 at 7 26 27 AM

@achasmita
Copy link
Contributor

Is the stage dataset snapshot you shared is the one we are using in staging, because I cannot see UUID table on my side, and also the other table has less row/pages compared to staging environment.

@shankari
Copy link
Contributor Author

shankari commented Oct 11, 2023

@achasmita the dataset for which you couldn't see the UUID table was the durham dataset. I also shared the stage dataset with you separately when we were debugging the previous issues (file stage-snapshot-test-dashboard.tar.gz). Note also that even in the dev environment, 9 of the rows are blank. Maybe you can start by debugging why that is happening?

@achasmita
Copy link
Contributor

I am using DB_HOST: "mongodb://db/openpath_stage" and there is no UUID table and there are only 9 rows for demographic table and they are not blank.
Screen Shot 2023-10-11 at 2 33 42 PM

@shankari
Copy link
Contributor Author

I would like some confirmation beyond the DB string that this is in fact the same dataset.

  • can you confirm that you have restored this from the staging dataset stage-snapshot-test-dashboard by switching the volume or deleting entries?
  • are the UUIDs here matching the ones on staging?

@achasmita
Copy link
Contributor

Yes, I removed previous volume and used the stage snapshot. Also those 9 rows in this table matches the one in staging env.
Screen Shot 2023-10-11 at 2 49 28 PM

@achasmita
Copy link
Contributor

I will try to remove volume and reload data again.

@achasmita
Copy link
Contributor

It is still same.

@achasmita
Copy link
Contributor

$ docker ps
CONTAINER ID   IMAGE                    COMMAND                  CREATED       STATUS       PORTS                              NAMES
aa33f3331ecb   e-mission/opdash:0.0.1   "/bin/bash /usr/src/…"   2 hours ago   Up 2 hours   0.0.0.0:8050->8050/tcp, 8080/tcp   op-admin-dashboard-dashboard-1
f67b21700790   mongo:4.4.0              "docker-entrypoint.s…"   6 days ago    Up 2 hours   0.0.0.0:27017->27017/tcp           op-admin-dashboard-db-1
$ docker exec -it op-admin-dashboard-dashboard-1 /bin/bash
root@aa33f3331ecb:/usr/src/app# source setup/activate.sh
(emission) root@aa33f3331ecb:/usr/src/app# ./e-mission-py.bash
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) 
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import emission.core.get_database as edb
Connecting to database URL mongodb://db/openpath_stage
>>> import pandas as pd
>>> all_survey_entries = list(edb.get_timeseries_db().find({"metadata.key": "manual/demographic_survey"}))
>>> all_survey_entries[0]["_id"]
ObjectId('62a20b7caa2dde2114165cb0')
>>> all_survey_entries[1]["_id"]
ObjectId('62a23791aa2dde660933e539')
>>> all_survey_entries[2]["_id"]
ObjectId('62a3df3faa2dde2114166d4b')
>>> all_survey_entries[3]["_id"]
ObjectId('62a3df3daa2dde2114166b25')
>>> all_survey_entries[4]["_id"]
ObjectId('62a520aeaa2dde21141678c6')
>>> all_survey_entries[5]["_id"]
ObjectId('62a523beaa2dde21141683d5')
>>> all_survey_entries[6]["_id"]
ObjectId('62a52430aa2dde21141692c7')
>>> all_survey_entries[7]["_id"]
ObjectId('62aba8deaa2dde64df7cef73')
>>> all_survey_entries[8]["_id"]
ObjectId('62abd399aa2dde64df7d076c')
>>> all_survey_entries[9]["_id"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> all_survey_entries[9]["_id"]

@achasmita
Copy link
Contributor

Screen Shot 2023-10-18 at 1 22 09 PM

@achasmita
Copy link
Contributor

Checking columns in both environment:

  1. '_id'
  2. 'user_id'
  3. 'How_old_are_you'
  4. 'What_is_your_gender'
  5. 'Do_you_have_a_driver_license'
  6. 'Are_you_a_student'
  7. 'What_is_the_highest_grade_or_d'
  8. 'Do_you_work_for_either_pay_or_'
  9. 'Which_one_below_describe_you_b'
  10. 'Do_you_own_or_rent_your_home'
  11. 'What_is_your_home_type'
  12. 'Please_identify_which_category'
  13. 'Including_yourself_how_many_p'
  14. 'Including_yourself_how_many_w'
  15. 'Including_yourself_how_many_c'
  16. 'Including_yourself_how_many_p_001'
  17. 'How_many_vehicles_are_owned_l'
  18. 'If_you_were_unable_to_use_your'
  19. 'Do_you_have_a_condition_or_han'
  20. 'How_long_you_had_this_conditio'
  21. 'Do_you_have_more_than_one_job'
  22. 'Do_you_work_full_time_or_part_'
  23. 'Which_best_describes_your_prim'
  24. 'At_your_primary_job_do_you_ha'
  25. 'Do_you_have_the_option_of_work'
  26. 'How_many_days_do_you_usually_w_001'
  27. 'What_days_of_the_week_do_you_t'
  28. 'How_did_you_usually_get_to_you'
  29. 'What_is_your_typical_access_eg'
  30. 'data.ts'
  31. 'data.fmt_time'
  32. 'data.local_dt.year'
  33. 'data.local_dt.month'
  34. 'data.local_dt.day',
  35. 'data.local_dt.hour'
  36. 'data.local_dt.minute'
  37. 'data.local_dt.second'
  38. 'data.local_dt.weekday'
  39. 'data.local_dt.timezone'

Extra columns in staging environment:
 
40. do_you_consider_yourself_to_be'
 
41. 'What_is_your_race_ethinicity'
 
42. 'Are_you_a_paid_worker'
 
43. 'Including_yourself_how_many_w_001'
 
44. 'How_many_vehicles_are_ow'
 
45. 'Please_describe_your_primary_job'
 
46. 'How_many_motor_vehicles_are_ow_001'

@achasmita
Copy link
Contributor

I checked individual data for similar columns in both environment and they are all same. Both the table have matching data.

@shankari
Copy link
Contributor Author

I checked individual data for similar columns in both environment and they are all same. Both the table have matching data.

What is "both environment"? Are you saying that openpath-stage.nrel.gov and your local dev environment have the same data? Because I would like to see proof if that is the case.

@achasmita
Copy link
Contributor

In my local dev environment, I have 9 rows * 39 columns for demographic table:
Screen Shot 2023-10-19 at 9 19 39 AM
Screen Shot 2023-10-19 at 9 20 05 AM
Screen Shot 2023-10-19 at 9 20 21 AM
Screen Shot 2023-10-19 at 9 20 50 AM
Screen Shot 2023-10-19 at 9 21 06 AM
Screen Shot 2023-10-19 at 9 21 23 AM
Screen Shot 2023-10-19 at 9 21 39 AM

checking data for same user_id and _id in openpath-stage.nrel.gov:
data is same as in local dev environment and includes some extra columns which are blank:
Screen Shot 2023-10-19 at 9 49 21 AM
Screen Shot 2023-10-19 at 9 49 52 AM
Screen Shot 2023-10-19 at 9 50 15 AM
Screen Shot 2023-10-19 at 9 51 14 AM
Screen Shot 2023-10-19 at 9 51 36 AM
Screen Shot 2023-10-19 at 9 51 55 AM
Screen Shot 2023-10-19 at 9 52 18 AM
Screen Shot 2023-10-19 at 9 53 21 AM
Screen Shot 2023-10-19 at 10 09 21 AM

@achasmita
Copy link
Contributor

I checked all the rows of data I can see on demographic table in dev environment and compared one with staging environment with same _id and they are similar. Also I checked datatypes for all column and verified each data of every columns (especially the column with no data) and they are all correct.
Screen Shot 2023-10-20 at 1 36 29 PM
Screen Shot 2023-10-20 at 1 36 57 PM
Screen Shot 2023-10-20 at 1 37 38 PM

@shankari
Copy link
Contributor Author

@achasmita so what do you plan to do next?

@achasmita
Copy link
Contributor

Added some log statements in code :

    logging.debug("After modifying, df columns are %s" % df.columns)
    logging.debug("The data in column is %s" %df.head(5))
    logging.debug("The data in column is %s" %df["Which_one_below_describe_you_b"])

@achasmita
Copy link
Contributor

DEBUG:root:After modifying, df columns are Index(['_id', 'user_id', 'How_old_are_you', 'What_is_your_gender',
'Do_you_have_a_driver_license', 'Are_you_a_student',
'What_is_the_highest_grade_or_d', 'Do_you_work_for_either_pay_or_',
'Which_one_below_describe_you_b', 'Do_you_own_or_rent_your_home',
'What_is_your_home_type', 'Please_identify_which_category',
'Including_yourself_how_many_p', 'Including_yourself_how_many_w', 'Including_yourself_how_many_c', 'Including_yourself_how_many_p_001',
'How_many_vehicles_are_owned_l', 'If_you_were_unable_to_use_your',
'Do_you_have_a_condition_or_han', 'How_long_you_had_this_conditio',
'Do_you_have_more_than_one_job', 'Do_you_work_full_time_or_part_',
'Which_best_describes_your_prim', 'At_your_primary_job_do_you_ha',
'Do_you_have_the_option_of_work', 'How_many_days_do_you_usually_w_001',
'What_days_of_the_week_do_you_t', 'How_did_you_usually_get_to_you',
'What_is_your_typical_access_eg', 'data.ts', 'data.fmt_time',
'data.local_dt.year', 'data.local_dt.month', 'data.local_dt.day',
'data.local_dt.hour', 'data.local_dt.minute', 'data.local_dt.second',
'data.local_dt.weekday', 'data.local_dt.timezone'],
dtype='object')
DEBUG:root:finished querying values for ['manual/demographic_survey'], count = 9
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:orig_ts_db_matches = 9, analysis_ts_db_matches = 0 INFO:werkzeug:172.18.0.1 - - [20/Oct/2023 21:19:40] "POST /_dash-update-component HTTP/1.1" 200 -
DEBUG:root:The data in column is                         _id  ... data.local_dt.timezone
0  62a20b7caa2dde2114165cb0  ...    America/Los_Angeles
1  62a23791aa2dde660933e539  ...    America/Los_Angeles
2  62a3df3faa2dde2114166d4b  ...    America/Los_Angeles
3  62a3df3daa2dde2114166b25  ...    America/Los_Angeles
4  62a520aeaa2dde21141678c6  ...    America/Los_Angeles

 [5 rows x 39 columns]

DEBUG:root:The data in column is 0                                            
 1    i_m_not_able_to_work__due_to_reasons_lik
2                                            
3                                            
4       i_m_temporarily_absent_from_a_job_now
5                                            
6                                            
7                                            
8                                            
Name: Which_one_below_describe_you_b, dtype: object

@shankari
Copy link
Contributor Author

@achasmita just to confirm, you are still seeing the surveys with blank columns on staging, correct? It is just that those entries are not in your local dev version? That seems a bit weird because we know, from the public dashboard, that there should be 59 users
https://openpath-stage.nrel.gov/public/

Are you seeing 59 entries on dev?

@achasmita
Copy link
Contributor

achasmita commented Oct 20, 2023

Yes I can see those blank columns on staging and those entries are not in local dev version, it only includes 9 entries in local dev with data.fmt_time 2022-06-09 - 2022-06-15 .

@shankari
Copy link
Contributor Author

@achasmita where is the PR with the logs?

@achasmita
Copy link
Contributor

achasmita commented Oct 20, 2023

oop I committed inside previous PR #82

@shankari
Copy link
Contributor Author

so will there be a new PR with only this change? We don't want to make too many new changes until we have a stable version of the existing ones

@achasmita
Copy link
Contributor

created new PR #83

@shankari
Copy link
Contributor Author

@achasmita I also downloaded a new snapshot of the staging data and shared it with you. I would also like to understand the fundamental reason why you see only 9 users while the staging database has > 50

@achasmita
Copy link
Contributor

I reloaded data and getting same 9 rows.

@shankari
Copy link
Contributor Author

@achasmita I don't think you are loading the data correctly in that case.
As I have said multiple times before, you need to be able to see > 50 users.
Please document how you are loading data and whether you are seeing any errors.

Concretely, the _id field of the UUID ending with ec6 (the first row), is different in the two screenshots from
#76 (comment)

On dev, it ends in cb0, on stage, it ends in 76c

@achasmita
Copy link
Contributor

I am using the following script to load data.

MONGODUMP_FILE=$1

echo "Copying file to docker container"
docker cp $MONGODUMP_FILE op-admin-dashboard-db-1:/tmp

FILE_NAME=`basename $MONGODUMP_FILE`

echo "Restoring the dump from $FILE_NAME"
docker exec -e MONGODUMP_FILE=$FILE_NAME op-admin-dashboard-db-1 bash -c 'cd /tmp && tar xvf $MONGODUMP_FILE && mongorestore'

@achasmita
Copy link
Contributor

@achasmita I don't think you are loading the data correctly in that case. As I have said multiple times before, you need to be able to see > 50 users. Please document how you are loading data and whether you are seeing any errors.

Concretely, the _id field of the UUID ending with ec6 (the first row), is different in the two screenshots from #76 (comment)

On dev, it ends in cb0, on stage, it ends in 76c

there are two 'ec6' in both where one ends with 'cb0' and other '76c'

@achasmita
Copy link
Contributor

Added some log statements in code :

    logging.debug("After modifying, df columns are %s" % df.columns)
    logging.debug("The data in column is %s" %df.head(5))
    logging.debug("The data in column is %s" %df["Which_one_below_describe_you_b"])

have added this log statements so that we can check the logs on staging.

@achasmita
Copy link
Contributor

Finally!! I am able to see 111 rows. I was having trouble loading data before but after fixing memory in docker desktop it worked. Now I can see those blank rows in staging.

@achasmita
Copy link
Contributor

while running I saw this warning

/usr/src/app/app_sidebar_collapsible.py:167: UserWarning:

DataFrame columns are not unique, some columns will be omitted.

@achasmita
Copy link
Contributor

Investigating columns in demographic table:
After modifying and removing unnecessay column there are 124 columns in demographic table
[111 rows x 124 columns]

 1 _id
2 user_id
3 How_old_are_you
4 What_is_your_gender
5 Do_you_have_a_driver_license
6 Are_you_a_student
7 What_is_the_highest_grade_or_d
8 Do_you_work_for_either_pay_or_
9 Which_one_below_describe_you_b
10 Do_you_own_or_rent_your_home
11 What_is_your_home_type
12 Please_identify_which_category
13 Including_yourself_how_many_p
14 Including_yourself_how_many_w
15 Including_yourself_how_many_c
16 Including_yourself_how_many_p_001
17 How_many_vehicles_are_owned_l
18 If_you_were_unable_to_use_your
19 Do_you_have_a_condition_or_han
20 How_long_you_had_this_conditio
21 Do_you_have_more_than_one_job
22 Do_you_work_full_time_or_part_
23 Which_best_describes_your_prim
24 At_your_primary_job_do_you_ha
25 Do_you_have_the_option_of_work
26 How_many_days_do_you_usually_w_001
27 What_days_of_the_week_do_you_t
28 How_did_you_usually_get_to_you
29 What_is_your_typical_access_eg
30 data.ts
31 data.fmt_time
32 data.local_dt.year
33 data.local_dt.month
34 data.local_dt.day
35 data.local_dt.hour
36 data.local_dt.minute
37 data.local_dt.second
38 data.local_dt.weekday
39 data.local_dt.timezone
40 How_old_are_you
41 What_is_your_gender
42 do_you_consider_yourself_to_be
43 What_is_your_race_ethnicity
44 Do_you_have_a_driver_license
45 Are_you_a_student
46 What_is_the_highest_grade_or_d
47 Are_you_a_paid_worker
48 Which_one_below_describe_you_b
49 Do_you_own_or_rent_your_home
50 What_is_your_home_type
51 Please_identify_which_category
52 Including_yourself_how_many_p
53 Including_yourself_how_many_w
54 Including_yourself_how_many_w_001
55 Including_yourself_how_many_p_001
56 How_many_motor_vehicles_are_ow
57 If_you_were_unable_to_use_your
58 Do_you_have_a_condition_or_han
59 How_long_you_had_this_conditio
60 Do_you_have_more_than_one_job
61 Do_you_work_full_time_or_part_
62 Which_best_describes_your_prim
63 Please_describe_your_primary_job
64 At_your_primary_job_do_you_ha
65 Do_you_have_the_option_of_work
66 How_many_days_do_you_usually_w_001
67 What_days_of_the_week_do_you_t
68 How_old_are_you
69 What_is_your_gender
70 do_you_consider_yourself_to_be
71 What_is_your_race_ethnicity
72 Do_you_have_a_driver_license
73 Are_you_a_student
74 What_is_the_highest_grade_or_d
75 Are_you_a_paid_worker
76 Which_one_below_describe_you_b
77 Do_you_own_or_rent_your_home
78 What_is_your_home_type
79 Please_identify_which_category
80 Including_yourself_how_many_p
81 Including_yourself_how_many_w
82 Including_yourself_how_many_w_001
83 Including_yourself_how_many_p_001
84 How_many_motor_vehicles_are_ow
85 If_you_were_unable_to_use_your
86 Do_you_have_a_condition_or_han
87 How_long_you_had_this_conditio
88 Do_you_have_more_than_one_job
89 Do_you_work_full_time_or_part_
90 Which_best_describes_your_prim
91 Please_describe_your_primary_job
92 At_your_primary_job_do_you_ha
93 Do_you_have_the_option_of_work
94 How_many_days_do_you_usually_w_001
95 What_days_of_the_week_do_you_t
96 At_your_primary_job_do_you_ha
97 Which_best_describes_your_prim
98 Do_you_work_full_time_or_part_
99 Do_you_have_the_option_of_work
100 Please_describe_your_primary_job
101 Do_you_have_more_than_one_job
102 What_days_of_the_week_do_you_t
103 How_many_days_do_you_usually_w_001
104 Which_one_below_describe_you_b
105 What_is_your_race_ethnicity
106 Are_you_a_student
107 What_is_the_highest_grade_or_d
108 do_you_consider_yourself_to_be
109 What_is_your_gender
110 How_old_are_you
111 Are_you_a_paid_worker
112 Do_you_have_a_driver_license
113 How_long_you_had_this_conditio
114 Including_yourself_how_many_w_001
115 Including_yourself_how_many_p
116 Do_you_own_or_rent_your_home
117 Please_identify_which_category
118 How_many_motor_vehicles_are_ow_001
119 Including_yourself_how_many_p_001
120 If_you_were_unable_to_use_your
121 Including_yourself_how_many_w
122 What_is_your_home_type
123 How_many_motor_vehicles_are_ow
124 Do_you_have_a_condition_or_han

@achasmita
Copy link
Contributor

achasmita commented Oct 26, 2023

And most of these columns are repeated 4 or 3 times:

1 If_you_were_unable_to_use_your - 4
2 How_many_motor_vehicles_are_ow - 4
3 What_days_of_the_week_do_you_t - 4
4 What_is_the_highest_grade_or_d - 4
5 How_old_are_you - 4
6 Including_yourself_how_many_w -4
7 Do_you_work_full_time_or_part_ -4
8 What_is_your_gender - 4
9 Do_you_have_the_option_of_work - 4
10 do_you_consider_yourself_to_be - 3
11 How_long_you_had_this_conditio -4
12 Are_you_a_student -4
13 Do_you_have_more_than_one_job -4
14 Do_you_have_a_condition_or_han -4
15 Including_yourself_how_many_p_001 -4
16 Do_you_own_or_rent_your_home - 4
17 What_is_your_home_type - 4
18 How_many_days_do_you_usually_w_001-4
19 Please_describe_your_primary_job -3
20 Do_you_have_a_driver_license - 4
21 Which_one_below_describe_you_b-4
22 Please_identify_which_category-4
23 At_your_primary_job_do_you_ha-4
24 Which_best_describes_your_prim-4
25 Including_yourself_how_many_p-4
26 Are_you_a_paid_worker-3
27 Including_yourself_how_many_w_001-3
28 What_is_your_race_ethnicity-3

After splitting columns name to display only the question part, they will lose their unique identity:

'data.jsonDocResponse.aSfdnWs9LE6q8YEF7u9n85.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.data.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.aQhPrHNVZA6L2cxBaMeE9Y.group_hg4zz25.What_is_your_race_ethnicity'

the above repeated columns become duplicate columns, I think that is why we are getting this warning:

/usr/src/app/app_sidebar_collapsible.py:167: UserWarning:

DataFrame columns are not unique, some columns will be omitted.

And also when this duplicate columns are omitted we are losing data on that part.

@achasmita
Copy link
Contributor

'data.jsonDocResponse.aSfdnWs9LE6q8YEF7u9n85.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.data.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.aQhPrHNVZA6L2cxBaMeE9Y.group_hg4zz25.What_is_your_race_ethnicity'

For now I am trying to change 3rd key of dictionary aSfdnWs9LE6q8YEF7u9n85, data,aQhPrHNVZA6L2cxBaMeE9Y into same key(e.g. survey_id)so that they all look similar and we get single column instead of 3 different column for same survey question.

@shankari
Copy link
Contributor Author

I am not sure if that is a great idea. What is the problem you are trying to solve, and why is this the correct approach for solving it?

Again, our goal is not to have a bunch of hacks to "get things to work", our goal is to have a principled implementation

@achasmita
Copy link
Contributor

achasmita commented Oct 30, 2023

On my dev environment I can see so many columns blank even when they are not empty. And I saw the warning":

/usr/src/app/app_sidebar_collapsible.py:167: UserWarning:

DataFrame columns are not unique, some columns will be omitted.

while running the code.

When I look into the code we are doing:

df.columns=[col.rsplit('.',1)[-1] if col.startswith('data.jsonDocResponse.') else col for col in df.columns] 

to simplify the column name.

But doing this we get 3 different columns with same name:

'data.jsonDocResponse.aSfdnWs9LE6q8YEF7u9n85.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.data.group_hg4zz25.What_is_your_race_ethnicity'
'data.jsonDocResponse.aQhPrHNVZA6L2cxBaMeE9Y.group_hg4zz25.What_is_your_race_ethnicity'

All the above columns result in 3 columns named What_is_your_race_ethnicity and as they are duplicate columns 2 of them are omitted randomly so we are losing data for two of the columns among the above 3.

Therefore, if we can make all those columns name identical we can extract all those data in same column and there will be no conflict of duplicate columns.

@shankari
Copy link
Contributor Author

shankari commented Oct 30, 2023

That is the problem. Why is this (aka "For now I am trying to change 3rd key of dictionary aSfdnWs9LE6q8YEF7u9n85, data,aQhPrHNVZA6L2cxBaMeE9Y into same key(e.g. survey_id") so that they all look similar and we get single column) the solution?

Again, our goal is not to have a bunch of hacks to "get things to work", our goal is to have a principled implementation

@achasmita
Copy link
Contributor

achasmita commented Oct 30, 2023

Yes either we can show all of this aSfdnWs9LE6q8YEF7u9n85.group_hg4zz25.What_is_your_race_ethnicity in column name but the problem will be if we have 3 different columns for the same survey question which increases the number of columns in the table and also there will be lot of blank cells in the table.

Combining them helps to solve this problem and the option I see is to make them identical. Also, I am still exploring other options.

@achasmita
Copy link
Contributor

achasmita commented Oct 31, 2023

Problem: Duplicate column names in df, leading to data loss and blank cells in the table.
Solution1: Edit dictionary keys so that we can combine all column for same survey question.

Pros:

  • Ensures all data is retained without loss
  • simplifies df structure for easier analysis
  • reduces the number of columns in table [111 rows x 46 columns]

Solution 2: Include unique identity in column name while simplifying column names in df

Pros:

  • Ensures all data is retained without loss

Cons:
-Increases number of columns in table [111 rows x 124 columns]
-multiple column with same survey question making it harder for analysis
-Lots of blank cells

Result(Solution1):
Screen Shot 2023-10-31 at 9 13 00 AM
Result(Solution2):
Screen Shot 2023-10-31 at 9 10 27 AM

Screen Shot 2023-10-31 at 10 06 10 AM

@achasmita achasmita moved this to Issues being worked on in OpenPATH Tasks Overview Oct 31, 2023
@shankari
Copy link
Contributor Author

shankari commented Oct 31, 2023

Problem: Duplicate column names in df, leading to data loss and blank cells in the table.

Is there actually data loss? A: Yes because if there are three similar columns, we only display data from one.

Solution1: Edit dictionary keys so that we can combine all column for same survey question.

But it is not clear that the survey question is "same" given that it is actually from multiple surveys
So for context, we support users to specify their own surveys right now in the dynamic config
So we can start with survey 3lkjwilaejrkjakl and then they can change it later to survey ajfsi34234qerf
There is no guarantee that the questions in the two surveys are the same or mean the same thing.

For example, survey ajfsi34234qerf may omit questions from survey 3lkjwilaejrkjakl completely or may change the context around the prompt or ....

So I know that the words What_is_your_race_ethnicity are the same, but in other contexts, the question could depend on the title - e.g. rate_this_experience or something where the "experience" may be different, or it is not clear to me if the value is truncated - so one is what_is_your_race_ethnicity and the other is what_is_your_race_ethnicity_and_class_or_caste_group and if the second is truncated, it will seem like the same question but it will not be the same.

so IMHO, instead of "hacking" this by replacing the survey id and forcing the results to be in the same format, we should support having multiple versions of the survey over time.

I would encourage you to look up examples of custom surveys at
https://github.com/e-mission/nrel-openpath-deploy-configs/tree/main/survey_resources
and compare against
https://github.com/e-mission/e-mission-phone/tree/master/survey-resources/data-xls

@achasmita
Copy link
Contributor

Other option can be:
Option3: Creating separate table for each survey(by implementing subtab):
Pros:
Ensures all data is retained without loss
simplifies df structure for easier analysis
reduces the chances of duplicity in column name

@shankari
Copy link
Contributor Author

shankari commented Nov 1, 2023

Bingo! That was what I had in mind as well. Program admins can then download the tables for each of their survey versions independently. And it doesn't matter if the questions are the same or different.

@achasmita
Copy link
Contributor

Ok I will start working on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Issues being worked on
Development

No branches or pull requests

2 participants