# Filter Large Volumes of Data in Python based on Criteria and Some Calculations

If you have a large volume of data and you are interested to filter it based on your criteria and in a very fast way, follow the steps of this tutorial. This tutorial also includes some steps on how to calculate the average of the columns of the data that we are interested to calculate.

# Requirements

-Anconda Python Development Environment
-Data in a csv format
-A filtering Criteria

# Setup

To have access to spyder it is recommended to download first Anaconda. After the download has been completed, you can run Anaconda and runch Spyder. If the launch button does not show but an Install button shows instead, make sure to install Spyder and then run it. After you open Spyder, you can create a new file and save the file in the same directory where your data is located. Now we are ready to start coding.

# Step 1 Data

Let's assume that we had an eye tracker and we collected a large volume of data from users eyes. Before any data analysis we need to process or filter the data. This way we make sure that our data analysis will be correct and reliable. Have a look at the data I attached below. This is a data sample collected from an eye tracker on pupil positions of the user.

# Step 2 Filter Data Based on Criteria

Now we can filter the eye tracking data based on the Criteria. If you look at the pupil positions data fgrom the eye tracker in step 1, there is a column named confidence that takes values from 0 to 1. Confidence is an assessment by the pupil detector on how sure we can be on this measurement. A value of `0` indicates no confidence. `1` indicates perfect confidence. Useful data carries a confidence value greater than ~0.6. A `confidence` of exactly `0` means that we don't know anything. Therefore we need to filter the data based on the confidence column and get only the values (columns) that we are interested in with confidence higher or equal to 0.6. Copy the code below and paste it into your spyder script that should be located in the same folder where your data is.
```                                        ```
import pandas as pd
import numpy as np

"get data with confidence => 0.6 for pupil positions"
pupil_positions_confidence = pupil_positions[(pupil_positions['confidence'] >= 0.6)]
filtered_pupil_positions = pupil_positions_confidence[['confidence','norm_pos_x', 'norm_pos_y', 'diameter_3d']]
filtered_pupil_positions.to_csv('filtered_pupil_positions.csv', index=False)
```
```
After we run the above script, we now have filtered the pupil positions data getting only the values/columns that we are interested to analyse with confidence equal or greater to 0.6.
We can now see that we have only the columns that we selected that have the confidence valuas equal and greater to 0.6. But we can also see that the there are some missing values in the column diameter_3d. We can remove those missing values by including dropna() function. Have a look at the python code below.
```                                        ```
import pandas as pd
import numpy as np

"get data with confidence => 0.6 for pupil positions"
pupil_positions_confidence = pupil_positions[(pupil_positions['confidence'] >= 0.6)]
filtered_pupil_positions = pupil_positions_confidence[['confidence','norm_pos_x', 'norm_pos_y', 'diameter_3d']].dropna()
filtered_pupil_positions.to_csv('filtered_pupil_positions.csv', index=False)
```
```
Now we have filetered data based on the confidence and we have removed the data with missing values.

# Step 3 Calculate Average

If you are interested to calculate average of our filtered pupil positions data, we can achieve that with the script below.
```                                        ```
import pandas as pd
import numpy as np

"get all data with confidence => 0.6 for pupil positions"
pupil_positions_confidence = pupil_positions[(pupil_positions['confidence'] >= 0.6)]
filtered_pupil_positions = pupil_positions_confidence[['confidence','norm_pos_x', 'norm_pos_y', 'diameter_3d']].dropna()
filtered_pupil_positions.to_csv('filtered_pupil_positions.csv', index=False)

"write all data to new csv file"
avg_data = pd.DataFrame([[np.mean(filtered_pupil_positions['norm_pos_x']), np.mean(filtered_pupil_positions['norm_pos_y']), np.mean(filtered_pupil_positions['diameter_3d'])]],
columns=['pupil_norm_pos_x','pupil_norm_pos_y', 'pupil_diameter_3d'])
avg_data.to_csv('avg_filtered_pupil_positions.csv')

```
```