Filter Large Volumes of Data in Python based on Criteria and Some Calculations

If you have a large volume of data and you are interested to filter it based on your criteria and in a very fast way, follow the steps of this tutorial. This tutorial also includes some steps on how to calculate the average of the columns of the data that we are interested to calculate.

Requirements

-Anconda Python Development Environment
-Data in a csv format
-A filtering Criteria

Setup

To have access to spyder it is recommended to download first Anaconda. After the download has been completed, you can run Anaconda and runch Spyder. If the launch button does not show but an Install button shows instead, make sure to install Spyder and then run it. After you open Spyder, you can create a new file and save the file in the same directory where your data is located. Now we are ready to start coding.

Step 1 Data

Let's assume that we had an eye tracker and we collected a large volume of data from users eyes. Before any data analysis we need to process or filter the data. This way we make sure that our data analysis will be correct and reliable. Have a look at the data I attached below. This is a data sample collected from an eye tracker on pupil positions of the user.

Download Pupil Positions Data Example CSV

Step 2 Filter Data Based on Criteria

Now we can filter the eye tracking data based on the Criteria. If you look at the pupil positions data fgrom the eye tracker in step 1, there is a column named confidence that takes values from 0 to 1. Confidence is an assessment by the pupil detector on how sure we can be on this measurement. A value of `0` indicates no confidence. `1` indicates perfect confidence. Useful data carries a confidence value greater than ~0.6. A `confidence` of exactly `0` means that we don't know anything. Therefore we need to filter the data based on the confidence column and get only the values (columns) that we are interested in with confidence higher or equal to 0.6. Copy the code below and paste it into your spyder script that should be located in the same folder where your data is.

Language used: Python

                                        
                                        import pandas as pd
                                        import numpy as np

                                        "read csv file"
                                        pupil_positions = pd.read_csv('pupil_positions.csv')

                                        "get data with confidence => 0.6 for pupil positions"
                                        pupil_positions_confidence = pupil_positions[(pupil_positions['confidence'] >= 0.6)]
                                        filtered_pupil_positions = pupil_positions_confidence[['confidence','norm_pos_x', 'norm_pos_y', 'diameter_3d']]
                                        filtered_pupil_positions.to_csv('filtered_pupil_positions.csv', index=False)

After we run the above script, we now have filtered the pupil positions data getting only the values/columns that we are interested to analyse with confidence equal or greater to 0.6.

Download Filtered Pupil Positions in CSV

We can now see that we have only the columns that we selected that have the confidence valuas equal and greater to 0.6. But we can also see that the there are some missing values in the column diameter_3d. We can remove those missing values by including dropna() function. Have a look at the python code below.

Language used: Python

                                        
                                        import pandas as pd
                                        import numpy as np

                                        "read csv file"
                                        pupil_positions = pd.read_csv('pupil_positions.csv')

                                        "get data with confidence => 0.6 for pupil positions"
                                        pupil_positions_confidence = pupil_positions[(pupil_positions['confidence'] >= 0.6)]
                                        filtered_pupil_positions = pupil_positions_confidence[['confidence','norm_pos_x', 'norm_pos_y', 'diameter_3d']].dropna()
                                        filtered_pupil_positions.to_csv('filtered_pupil_positions.csv', index=False)

Now we have filetered data based on the confidence and we have removed the data with missing values.

Download Filtered Pupil Positions Without Missing Values in CSV

Step 3 Calculate Average

If you are interested to calculate average of our filtered pupil positions data, we can achieve that with the script below.

Language used: Python

                                        
                                        import pandas as pd
                                        import numpy as np

                                        "read csv file"
                                        pupil_positions = pd.read_csv('pupil_positions.csv')

                                        "get all data with confidence => 0.6 for pupil positions"
                                        pupil_positions_confidence = pupil_positions[(pupil_positions['confidence'] >= 0.6)]
                                        filtered_pupil_positions = pupil_positions_confidence[['confidence','norm_pos_x', 'norm_pos_y', 'diameter_3d']].dropna()
                                        filtered_pupil_positions.to_csv('filtered_pupil_positions.csv', index=False)

                                        "write all data to new csv file"
                                        avg_data = pd.DataFrame([[np.mean(filtered_pupil_positions['norm_pos_x']), np.mean(filtered_pupil_positions['norm_pos_y']), np.mean(filtered_pupil_positions['diameter_3d'])]], 
                                                                columns=['pupil_norm_pos_x','pupil_norm_pos_y', 'pupil_diameter_3d'])
                                        avg_data.to_csv('avg_filtered_pupil_positions.csv')

Download Average Filtered Pupil Positions Data in CSV

Conclusions

In this tutorial we used Spyder application in Anaconda and python code to filter data from an eye tracker using a criteria. We also dropped any missing values and we calculated the average on each column of our filtered data. Using the scripts above, you can now filter large amounts of data in any datafile that you have and calculate the average using python.