The goal of this project is to create machine learning models using Tor network packet flow data, to determine whether an instance is communicating with a monitored website or an unmonitored website, and to identify its destination if it is a monitored website.
In the closed-world experiments, the user can only access monitored(preivously-known) websites.
The goal is to classify the 95 monitored websites.
We used an SVM, a decision tree, and a random forest model.
In the open-world experiments, the user to access any websites within the system.
Data can be classified into two parts
monitored data: the attacker is interested inunmonitored data: deemed irrelevant by the attacker
monitored website instances are treated as positive samples, and unmonitored website instances are treated as negative samples.
Determine whether the web traffic trace corresponds to a monitored website.āØTo do this, we reassign the label '1' to all monitored website instances (positive samples) and assign the label '-1' to all unmonitored website instances (negative samples)
Classify 95 monitored website traces with unique labels against additional unmonitored websites.āØIn the multi-class setting, we label the monitored website instances with {0, 1, 2, ..., 94} and the unmonitored website instances with the label '-1'.
We used a decision tree and a random forest model.
You can download monitored and unmonitored data from the below google drive.
[dataset] (https://drive.google.com/drive/folders/13sDplxKUNmntbYr6WhpqQARiBvH41Oum)
You can run the code in Colab. Please upload the downloaded data to Colab's file.
ā¼ļø You need to replace the path in this code with the absolute path of the filesĀ mon_standard10.pklĀ andĀ unmon_standard10_3000.pklĀ on your drive ā¼ļø
with open("/content/sample_data/mon_standard.pkl", "rb") as file:
with open("/content/sample_data/unmon_standard10_3000.pkl", "rb") as file: