Filter Large Volumes of Data in Python based on Criteria and Some Calculations
If you have a large volume of data and you are interested to filter it based on your criteria and in a very fast way, follow the steps of this tutorial.
This tutorial also includes some steps on how to calculate the average of the columns of the data that we are interested to calculate.
To have access to spyder it is recommended to download first Anaconda.
After the download has been completed, you can run Anaconda and runch Spyder. If the launch button does not show but an Install button shows instead, make sure to install Spyder and then run it.
After you open Spyder, you can create a new file and save the file in the same directory where your data is located. Now we are ready to start coding.
Step 1 Data
Let's assume that we had an eye tracker and we collected a large volume of data from users eyes. Before any data analysis we need to process or filter the data.
This way we make sure that our data analysis will be correct and reliable. Have a look at the data I attached below. This is a data sample collected from an eye tracker on pupil positions of the user.
Now we can filter the eye tracking data based on the Criteria. If you look at the pupil positions data fgrom the eye tracker in step 1, there is a column named confidence that takes values from 0 to 1.
Confidence is an assessment by the pupil detector on how sure we can be on this measurement. A value of `0` indicates no confidence. `1` indicates perfect confidence. Useful data carries a confidence value greater than ~0.6. A `confidence` of exactly `0` means that we don't know anything.
Therefore we need to filter the data based on the confidence column and get only the values (columns) that we are interested in with confidence higher or equal to 0.6.
Copy the code below and paste it into your spyder script that should be located in the same folder where your data is.
We can now see that we have only the columns that we selected that have the confidence valuas equal and greater to 0.6. But we can also see that the there are some missing values in the column diameter_3d.
We can remove those missing values by including dropna() function. Have a look at the python code below.
In this tutorial we used Spyder application in Anaconda and python code to filter data from an eye tracker using a criteria. We also dropped any missing values and we calculated the average on each column of our filtered data.
Using the scripts above, you can now filter large amounts of data in any datafile that you have and calculate the average using python.