diff --git a/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb b/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb index 34c14aab..5c7af3b3 100644 --- a/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb +++ b/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb @@ -1,1487 +1,1197 @@ { - "cells": [ - { - "cell_type": "markdown", - "source": [ - "## Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse\r\n", - "\r\n", - "__Notebook Version:__ 1.0
\r\n", - "__Python Version:__ Python 3.8 - AzureML
\r\n", - "__Required Packages:__ azureml-synapse, Msticpy, azure-storage-file-datalake
\r\n", - "__Platforms Supported:__ Azure Machine Learning Notebooks connected to Azure Synapse Workspace\r\n", - " \r\n", - "__Data Source Required:__ Yes\r\n", - "\r\n", - "__Data Source:__ CommonSecurityLogs\r\n", - "\r\n", - "__Spark Version:__ 3.1 or above\r\n", - " \r\n", - "### Description\r\n", - "In this sample guided scenario notebook, we will demonstrate how to set up continuous data pipeline to store data into azure data lake storage (ADLS) and \r\n", - "then hunt on that data at scale using distributed processing via Azure Synapse workspace connected to serverless Spark pool. \r\n", - "Once historical dataset is available in ADLS , we can start performing common hunt operations, create a baseline of normal behavior using PySpark API and also apply data transformations \r\n", - "to find anomalous behaviors such as periodic network beaconing as explained in the blog - [Detect Network beaconing via Intra-Request time delta patterns in Microsoft Sentinel - Microsoft Tech Community](https://techcommunity.microsoft.com/t5/azure-sentinel/detect-network-beaconing-via-intra-request-time-delta-patterns/ba-p/779586). \r\n", - "You can use various other spark API to perform other data transformation to understand the data better. \r\n", - "The output generated can also be further enriched to populate Geolocation information and also visualize using Msticpy capabilities to identify any anomalies. \r\n", - ".
\r\n", - "*** Python modules download may be needed. ***
\r\n", - "*** Please run the cells sequentially to avoid errors. Please do not use \"run all cells\". ***
\r\n", - "\r\n", - "## Table of Contents\r\n", - "1. Warm-up\r\n", - "2. Authentication to Azure Resources\r\n", - "3. Configure Azure ML and Azure Synapse Analytics\r\n", - "4. Load the Historical and current data\r\n", - "5. Data Wrangling using Spark\r\n", - "6. Enrich the results\r\n", - "7. Conclusion\r\n", - "\r\n" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Warm-up" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "> **Note**: Install below packages only for the first time and restart the kernel once done." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "# Install AzureML Synapse package to use spark magics\r\n", - "import sys\r\n", - "!{sys.executable} -m pip install azureml-synapse" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse\n", + "\n", + "__Notebook Version:__ 1.1
\n", + "__Python Version:__ Python 3.8 - AzureML
\n", + "__Required Packages:__ Msticpy
\n", + "__Platforms Supported:__ Azure Machine Learning Notebooks connected to Azure Synapse Workspace\n", + " \n", + "__Data Source Required:__ Yes\n", + "\n", + "__Data Source:__ CommonSecurityLogs\n", + "\n", + "__Spark Version:__ 3.1 or above\n", + " \n", + "### Description\n", + "In this sample guided scenario notebook, we will demonstrate how to set up continuous data pipeline to store data into azure data lake storage (ADLS) and \n", + "then hunt on that data at scale using distributed processing via Azure Synapse workspace connected to serverless Spark pool. \n", + "Once historical dataset is available in ADLS , we can start performing common hunt operations, create a baseline of normal behavior using PySpark API and also apply data transformations \n", + "to find anomalous behaviors such as periodic network beaconing as explained in the blog - [Detect Network beaconing via Intra-Request time delta patterns in Microsoft Sentinel - Microsoft Tech Community](https://techcommunity.microsoft.com/t5/azure-sentinel/detect-network-beaconing-via-intra-request-time-delta-patterns/ba-p/779586). \n", + "You can use various other spark API to perform other data transformation to understand the data better. \n", + "The output generated can also be further enriched to populate Geolocation information and also visualize using Msticpy capabilities to identify any anomalies. \n", + ".
\n", + "
\n", + "*** Please run the cells sequentially to avoid errors. Please do not use \"run all cells\". ***
\n", + "\n", + "## Table of Contents\n", + "1. Pre-requisites\n", + "2. Data Preparation\n", + "3. Data Wrangling using Spark\n", + "4. Data Enrichment\n", + "5. Data Visualization\n", + "6. Data Ingestion to Sentinel as Custom Table\n", + "7. Conclusion\n", + "\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Pre-requisites" + ] }, - "gather": { - "logged": 1632406406186 - } - } - }, - { - "cell_type": "code", - "source": [ - "# Install Azure storage datalake library to manipulate file systems\r\n", - "import sys\r\n", - "!{sys.executable} -m pip install azure-storage-file-datalake --pre" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Set-up Synapse Spark Pool in Azure Machine Learning\n", + "\n", + "This notebook requires Azure ML Spark Compute. If you are using it for the first time, follow the guidelines mentioned here \n", + "[Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-synapse-spark-pool?tabs=studio-ui)\n", + "\n", + "Once you have completed the pre-requisites, you will see AzureML Spark Compute in the dropdown menu for Compute. Select it and run any cell to start Spark Session. \n", + "
Please refer the docs [Managed (Automatic) Spark compute in Azure Machine Learning Notebooks](https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml) for more detailed steps along with screenshots.\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "# Install Azure storage datalake library to manipulate file systems\r\n", - "import sys\r\n", - "!{sys.executable} -m pip install msticpy" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Installing Libraries\n", + "Also, this notebook requires several libraries to be pre-installed prior to start Spark session. \n", + "
In order to install any libraries, you can use a conda file to configure a Spark session. Please save below file as conda.yml , check the Upload conda file checkbox. Then, select Browse, and choose the conda file saved earlier with the Spark session configuration you want.\n", + "\n", + "```yaml\n", + "name: msticpy\n", + "channels:\n", + "- defaults\n", + "dependencies:\n", + "- bokeh\n", + "- numpy\n", + "- pip:\n", + " - keyring>=13.2.1\n", + " - msticpy[azure]>=2.3.1\n", + "```\n", + "\n", + "### Set-up msticpyconfig.yaml to use with Azure ML with Managed Spark Pool\n", + "\n", + "Once we've done above steps, we need to make sure we have configuration to tell MSTICPy how to connect to your sentinel workspace or any other supported data providers.\n", + "\n", + "This configuration is stored in a configuration file (`msticpyconfig.yaml`).\n", + "\n", + "Learn more...\n", + "Although you don't need to know these details now, you can find more information here:\n", + "\n", + "- [MSTICPy Package Configuration](https://msticpy.readthedocs.io/en/latest/getting_started/msticpyconfig.html)\n", + "- [MSTICPy Settings Editor](https://msticpy.readthedocs.io/en/latest/getting_started/SettingsEditor.html)\n", + "\n", + "If you need a more complete walk-through of configuration, we have a separate notebook to help you:\n", + "- [Configuring Notebook Environment](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb)\n", + "- And for the ultimate walk-through of how to configure all your `msticpyconfig.yaml` settings see the [MPSettingsEditor notebook](https://github.com/microsoft/msticpy/blob/master/docs/notebooks/MPSettingsEditor.ipynb)\n", + "- The Azure-Sentinel-Notebooks GitHub repo also contains an template `msticpyconfig.yaml`, with commented-out sections that may also be helpful in finding your way around the settings if you want to dig into things by hand.\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "*** $\\color{red}{Note:~After~installing~the~packages,~please~restart~the~kernel.}$ ***" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "# Load Python libraries that will be used in this notebook\r\n", - "from azure.common.client_factory import get_client_from_cli_profile\r\n", - "from azure.common.credentials import get_azure_cli_credentials\r\n", - "from azure.mgmt.resource import ResourceManagementClient\r\n", - "from azureml.core import Workspace, LinkedService, SynapseWorkspaceLinkedServiceConfiguration, Datastore\r\n", - "from azureml.core.compute import SynapseCompute, ComputeTarget\r\n", - "from datetime import timedelta, datetime\r\n", - "from azure.storage.filedatalake import DataLakeServiceClient\r\n", - "from azure.core._match_conditions import MatchConditions\r\n", - "from azure.storage.filedatalake._models import ContentSettings\r\n", - "\r\n", - "import json\r\n", - "import os, uuid, sys\r\n", - "import IPython\r\n", - "import pandas as pd\r\n", - "from ipywidgets import widgets, Layout\r\n", - "from IPython.display import display, HTML\r\n", - "from pathlib import Path\r\n", - "\r\n", - "REQ_PYTHON_VER=(3, 6)\r\n", - "REQ_MSTICPY_VER=(1, 4, 4)\r\n", - "\r\n", - "display(HTML(\"

Starting Notebook setup...

\"))\r\n", - "if Path(\"./utils/nb_check.py\").is_file():\r\n", - " from utils.nb_check import check_python_ver, check_mp_ver\r\n", - "\r\n", - " check_python_ver(min_py_ver=REQ_PYTHON_VER)\r\n", - " try:\r\n", - " check_mp_ver(min_msticpy_ver=REQ_MSTICPY_VER)\r\n", - " except ImportError:\r\n", - " !pip install --upgrade msticpy\r\n", - " if \"msticpy\" in sys.modules:\r\n", - " importlib.reload(sys.modules[\"msticpy\"])\r\n", - " else:\r\n", - " import msticpy\r\n", - " check_mp_ver(REQ_MSTICPY_VER)\r\n", - "\r\n", - "# If not using Azure Notebooks, install msticpy with\r\n", - "# !pip install msticpy\r\n", - "\r\n", - "from msticpy.nbtools import nbinit\r\n", - "extra_imports = [\r\n", - " \"msticpy.nbtools.nbdisplay, draw_alert_entity_graph\",\r\n", - " \"msticpy.sectools.ip_utils, convert_to_ip_entities\",\r\n", - " \"msticpy.nbtools.ti_browser, browse_results\",\r\n", - " \"IPython.display, Image\",\r\n", - " \"msticpy.sectools.ip_utils, get_whois_info\",\r\n", - " \"msticpy.sectools.ip_utils, get_ip_type\"\r\n", - "]\r\n", - "nbinit.init_notebook(\r\n", - " namespace=globals(),\r\n", - " # additional_packages=[\"azureml-synapse\", \"azure-cli\", \"azure-storage-file-datalake\"],\r\n", - " extra_imports=extra_imports,\r\n", - ");\r\n", - "\r\n", - "\r\n", - "WIDGET_DEFAULTS = {\r\n", - " \"layout\": Layout(width=\"95%\"),\r\n", - " \"style\": {\"description_width\": \"initial\"},\r\n", - "}\r\n", - "\r\n", - "#Set pandas options\r\n", - "pd.get_option('max_rows',10)\r\n", - "pd.set_option('max_colwidth',50)" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Set environment variable MSTICPYCONFIG\n", + "Once you have your `msticpyconfig.yaml` ready in your Users directory, set the environment variable with the path to the file. \n", + "\n", + "Default file share is mounted and available to Managed (Automatic) Spark compute. \n", + "
Refer docs - [Accessing data on the default file share](https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml#accessing-data-on-the-default-file-share)\n", + "\n", + "In below code, first we will generate the abspath of the current mount dir and append path where `msticpyconfig.yaml` is stored. \n", + "
Execute the below cell to get the current path and then replace it in the next cell, also replace Username or path to `msticpyconfig.yaml`" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "import os\n", + "from ipywidgets import widgets\n", + "# Provide AML Directory path to msticpyconfig.yaml e.g. /Users/ashwin/msticpyconfig.yaml\n", + "yaml_path = widgets.Text(description='AML Directory Path to msticpyconfig.aml:')\n", + "display(yaml_path)" + ] }, - "gather": { - "logged": 1632982861159 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Configure Azure ML and Azure Synapse Analytics\r\n", - "\r\n", - "Please use notebook [Configurate Azure ML and Azure Synapse Analytics](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/Configurate%20Azure%20ML%20and%20Azure%20Synapse%20Analytics.ipynb) to configure environment. \r\n", - "\r\n", - "The notebook will configure existing Azure synapse workspace to create and connect to Spark pool. You can then create linked service and connect AML workspace to Azure Synapse workspaces.\r\n", - "
It will also configure data export rules to export data from Log analytics workspace CommonSecurityLog table to Azure Data lake storage Gen 2.\r\n", - "\r\n", - "> **Note**: Specify the input parameters in below step in order to connect AML workspace to synapse workspace using linked service." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "amlworkspace = \"\" # fill in your AML workspace name\r\n", - "subscription_id = \"\" # fill in your subscription id\r\n", - "resource_group = '' # fill in your resource groups for AML workspace\r\n", - "linkedservice = '' # fill in your linked service created to connect to synapse workspace" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1678913457350 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Retrieve AML notebook mount path\n", + "abs_path = os.path.abspath('.')\n", + "msticpyconfig_path = f\"{abs_path}{yaml_path.value}\"\n", + "print(msticpyconfig_path)" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Authentication to Azure Resources\r\n", - "\r\n", - "In this step we will connect aml workspace to linked service connected to Azure Synapse workspace" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "# Get the aml workspace\r\n", - "aml_workspace = Workspace.get(name=amlworkspace, subscription_id=subscription_id, resource_group=resource_group)\r\n", - "\r\n", - "# Retrieve a known linked service\r\n", - "linked_service = LinkedService.get(aml_workspace, linkedservice)" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "%env MSTICPYCONFIG = {msticpyconfig}" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Load Python libraries that will be used in this notebook\n", + "import json\n", + "import os, uuid, sys\n", + "import IPython\n", + "import pandas as pd\n", + "from ipywidgets import widgets, Layout\n", + "from IPython.display import display, HTML\n", + "from pathlib import Path\n", + "from datetime import timedelta, datetime\n", + "\n", + "import msticpy as mp\n", + "mp.refresh_config()\n", + "\n", + "from msticpy.context.geoip import GeoLiteLookup\n", + "from msticpy.context.tilookup import TIl\n", + "from msticpy.nbtools import entityschema\n", + "from msticpy.sectools.ip_utils import convert_to_ip_entities\n", + "from msticpy.nbtools.foliummap import FoliumMap, get_map_center\n", + "from msticpy.data.uploaders.loganalytics_uploader import LAUploader\n", + "mp.check_version()\n", + "# mp.init_notebook(globals())\n", + "\n", + "REQ_PYTHON_VER=(3, 6)\n", + "REQ_MSTICPY_VER=(1, 4, 4)\n", + "\n", + "display(HTML(\"

Starting Notebook setup...

\"))\n", + "\n", + "# If not using Azure Notebooks, install msticpy with\n", + "# %pip install msticpy\n", + "\n", + "WIDGET_DEFAULTS = {\n", + " \"layout\": Layout(width=\"95%\"),\n", + " \"style\": {\"description_width\": \"initial\"},\n", + "}\n", + "\n", + "#Set pandas options\n", + "pd.set_option('display.max_rows', 500)\n", + "pd.set_option('display.max_colwidth', 500)" + ] }, - "gather": { - "logged": 1632982882257 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Start Spark Session\r\n", - "Enter your Synapse Spark compute below. To find the Spark compute, please follow these steps:
\r\n", - "1. On the AML Studio left menu, navigate to **Linked Services**
\r\n", - "2. Click on the name of the Link Service you want to use
\r\n", - "3. Select **Spark pools** tab
\r\n", - "4. Get the Name of the Spark pool you want to use.
" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "synapse_spark_compute = input('Synapse Spark compute:')" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Preparation\n", + "In this step, we will define several details associated with ADLS account and specify input date and lookback period to calculate baseline. Based on the input dates and lookback period , we will load the data.\n", + "\n", + "\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "# Start spark session\r\n", - "%synapse start -s $subscription_id -w $amlworkspace -r $resource_group -c $synapse_spark_compute" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Primary storage info\n", + "account_name = '' # fill in your primary account name\n", + "container_name = '' # fill in your container name\n", + "subscription_id = '' # fill in your subscription id\n", + "resource_group = '' # fill in your resource groups for ADLS\n", + "workspace_name = '' # fill in your workspace name\n", + "device_vendor = \"Fortinet\" # Replace your desired network vendor from commonsecuritylogs\n", + "\n", + "# Datetime and lookback parameters\n", + "end_date = \"\" # fill in your input date\n", + "lookback_days = 21 # fill in lookback days if you want to run it on historical data. make sure you have historical data available in ADLS" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632245154374 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "from pyspark.sql.types import *\n", + "from pyspark.sql.window import Window\n", + "from pyspark.sql.functions import lag, col\n", + "from pyspark.sql.functions import *\n", + "from pyspark.sql import functions as F\n", + "from datetime import timedelta, datetime, date\n", + "\n", + "# Compiling ADLS paths from date string\n", + "end_date_str = end_date.split(\"-\")\n", + "current_path = f\"/y={end_date_str[0]}/m={end_date_str[1]}/d={end_date_str[2]}\"\n", + "\n", + "def generate_adls_paths(end_date, lookback_days, adls_path):\n", + " endDate = datetime.strptime(end_date, '%Y-%m-%d')\n", + " endDate = endDate - timedelta(days=1)\n", + " startDate = endDate - timedelta(days=lookback_days)\n", + " daterange = [startDate + timedelta(days=x) for x in range((endDate-startDate).days + 1)]\n", + "\n", + " pathlist = []\n", + " for day in daterange:\n", + " date_str = day.strftime('%Y-%m-%d').split(\"-\")\n", + " day_path = adls_path + f\"/y={date_str[0]}/m={date_str[1]}/d={date_str[2]}\"\n", + " pathlist.append(day_path)\n", + "\n", + " return pathlist\n", + "\n", + "adls_path = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/WorkspaceResourceId=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/providers/microsoft.operationalinsights/workspaces/{workspace_name}'\n", + "current_day_path = adls_path + current_path\n", + "historical_path = generate_adls_paths(end_date, lookback_days, adls_path)\n" + ] }, - "gather": { - "logged": 1632959931295 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Data Preparation\r\n", - "In this step, we will define several details associated with ADLS account and specify input date and lookback period to calculate baseline. Based on the input dates and lookback period , we will load the data.\r\n", - "\r\n", - "\r\n" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "\r\n", - "# Primary storage info\r\n", - "account_name = '' # fill in your primary account name\r\n", - "container_name = '' # fill in your container name\r\n", - "subscription_id = '' # fill in your subscription id\r\n", - "resource_group = '' # fill in your resource groups for ADLS\r\n", - "workspace_name = '' # fill in your workspace name\r\n", - "device_vendor = \"Fortinet\" # Replace your desired network vendor from commonsecuritylogs\r\n", - "\r\n", - "# Datetime and lookback parameters\r\n", - "end_date = \"\" # fill in your input date\r\n", - "lookback_days = 21 # fill in lookback days if you want to run it on historical data. make sure you have historical data available in ADLS" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Load Current day\n", + "\n", + "In this step, you will load the data based on the input date specified." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "from pyspark.sql.types import *\r\n", - "from pyspark.sql.window import Window\r\n", - "from pyspark.sql.functions import lag, col\r\n", - "from pyspark.sql.functions import *\r\n", - "from pyspark.sql import functions as F\r\n", - "from datetime import timedelta, datetime, date\r\n", - "\r\n", - "# Compiling ADLS paths from date string\r\n", - "end_date_str = end_date.split(\"-\")\r\n", - "current_path = f\"/y={end_date_str[0]}/m={end_date_str[1]}/d={end_date_str[2]}\"\r\n", - "\r\n", - "def generate_adls_paths(end_date, lookback_days, adls_path):\r\n", - " endDate = datetime.strptime(end_date, '%Y-%m-%d')\r\n", - " endDate = endDate - timedelta(days=1)\r\n", - " startDate = endDate - timedelta(days=lookback_days)\r\n", - " daterange = [startDate + timedelta(days=x) for x in range((endDate-startDate).days + 1)]\r\n", - "\r\n", - " pathlist = []\r\n", - " for day in daterange:\r\n", - " date_str = day.strftime('%Y-%m-%d').split(\"-\")\r\n", - " day_path = adls_path + f\"/y={date_str[0]}/m={date_str[1]}/d={date_str[2]}\"\r\n", - " pathlist.append(day_path)\r\n", - "\r\n", - " return pathlist\r\n", - "\r\n", - "adls_path = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/WorkspaceResourceId=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/providers/microsoft.operationalinsights/workspaces/{workspace_name}'\r\n", - "current_day_path = adls_path + current_path\r\n", - "historical_path = generate_adls_paths(end_date, lookback_days, adls_path)\r\n" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "try:\n", + " current_df = (\n", + " spark.read.option(\"recursiveFileLook\", \"true\")\n", + " .option(\"header\", \"true\")\n", + " .json(current_day_path)\n", + " )\n", + "\n", + " current_df = ( \n", + " current_df\n", + " .select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " )\n", + " .filter(F.col(\"DeviceVendor\") == device_vendor)\n", + " )\n", + "except Exception as e:\n", + " print(f\"Could note load the data due to error:{e}\")\n", + "\n", + "#Display the count of records\n", + "print(f\"No of records loaded from the current date specified: {current_df.count()}\")" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Load Historical data\n", + "\n", + "You can also perform the analysis on all historical data available in your ADLS account. The notebook is currently configured to run only on current date specified in input. \n", + "\n", + "If you need to perform the same analysis on historical data, run the cell below and under Data Wrangling using Spark -> Filtering Data code cell, replace `current_df` with `historical_df` variable. \n", + "
**Otherwise SKIP running below cell as it will result in an error if you do not have historical data**" + ] }, - "gather": { - "logged": 1632245154374 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Load Current day\r\n", - "\r\n", - "In this step, you will load the data based on the input date specified." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "\r\n", - "try:\r\n", - " current_df = (\r\n", - " spark.read.option(\"recursiveFileLook\", \"true\")\r\n", - " .option(\"header\", \"true\")\r\n", - " .json(current_day_path)\r\n", - " )\r\n", - "\r\n", - " current_df = ( \r\n", - " current_df\r\n", - " .select(\r\n", - " \"TimeGenerated\",\r\n", - " \"SourceIP\",\r\n", - " \"SourcePort\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " \"Protocol\",\r\n", - " \"ReceivedBytes\",\r\n", - " \"SentBytes\",\r\n", - " \"DeviceVendor\",\r\n", - " )\r\n", - " .filter(F.col(\"DeviceVendor\") == device_vendor)\r\n", - " )\r\n", - "except Exception as e:\r\n", - " print(f\"Could note load the data due to error:{e}\")\r\n", - "\r\n", - "#Display the count of records\r\n", - "print(f\"No of records loaded from the current date specified: {current_df.count()}\")" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "try:\n", + " #Read Previous days data\n", + " historical_df = (\n", + " spark.read.option(\"recursiveFileLook\", \"true\")\n", + " .option(\"header\", \"true\")\n", + " .json(historical_path[-1])\n", + " )\n", + " historical_df = historical_df.select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " ).filter(F.col(\"DeviceVendor\") == device_vendor)\n", + "\n", + " #Read all historical days data per day and union them together\n", + " for path in historical_path[:-1]:\n", + " daily_table = (\n", + " spark.read.option(\"recursiveFileLook\", \"true\")\n", + " .option(\"header\", \"true\")\n", + " .json(path)\n", + " )\n", + " daily_table = daily_table.select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " ).filter(F.col(\"DeviceVendor\") == device_vendor)\n", + " historical_df = historical_df.union(daily_table)\n", + "\n", + "except Exception as e:\n", + " print(f\"Could not load the data due to error:\\n {e}\")\n", + " \n", + "#Display the count of records\n", + "print(f\"\\n No of records loaded from the lookback days specified: {historical_df.count()}\")" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Load Historical data\r\n", - "\r\n", - "You can also perform the analysis on all historical data available in your ADLS account. The notebook is currently configured to run only on current date specified in input. \r\n", - "\r\n", - "If you need to perform the same analysis on historical data, run the cell below and under Data Wrangling using Spark -> Filtering Data code cell, replace `current_df` with `historical_df` variable. \r\n", - "
**Otherwise SKIP running below cell as it will result in an error if you do not have historical data**" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "\r\n", - "try:\r\n", - " #Read Previous days data\r\n", - " historical_df = (\r\n", - " spark.read.option(\"recursiveFileLook\", \"true\")\r\n", - " .option(\"header\", \"true\")\r\n", - " .json(historical_path[-1])\r\n", - " )\r\n", - " historical_df = historical_df.select(\r\n", - " \"TimeGenerated\",\r\n", - " \"SourceIP\",\r\n", - " \"SourcePort\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " \"Protocol\",\r\n", - " \"ReceivedBytes\",\r\n", - " \"SentBytes\",\r\n", - " \"DeviceVendor\",\r\n", - " ).filter(F.col(\"DeviceVendor\") == device_vendor)\r\n", - "\r\n", - " #Read all historical days data per day and union them together\r\n", - " for path in historical_path[:-1]:\r\n", - " daily_table = (\r\n", - " spark.read.option(\"recursiveFileLook\", \"true\")\r\n", - " .option(\"header\", \"true\")\r\n", - " .json(path)\r\n", - " )\r\n", - " daily_table = daily_table.select(\r\n", - " \"TimeGenerated\",\r\n", - " \"SourceIP\",\r\n", - " \"SourcePort\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " \"Protocol\",\r\n", - " \"ReceivedBytes\",\r\n", - " \"SentBytes\",\r\n", - " \"DeviceVendor\",\r\n", - " ).filter(F.col(\"DeviceVendor\") == device_vendor)\r\n", - " historical_df = historical_df.union(daily_table)\r\n", - "\r\n", - "except Exception as e:\r\n", - " print(f\"Could not load the data due to error:\\n {e}\")\r\n", - " \r\n", - "#Display the count of records\r\n", - "print(f\"\\n No of records loaded from the lookback days specified: {historical_df.count()}\")" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Wrangling using Spark" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Data Wrangling using Spark" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Filtering data\r\n", - "In this step, we will prepare dataset by filtering logs to only destination as Public/external IPs. For this, we are using regex and rlike spark API to filter internal to external network traffic." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "\r\n", - "PrivateIPregex = (\"^127\\.|^10\\.|^172\\.1[6-9]\\.|^172\\.2[0-9]\\.|^172\\.3[0-1]\\.|^192\\.168\\.\")\r\n", - "cooked_df = (current_df # replace historical_df if you want to use historical data\r\n", - " # use below filter if you have Palo Alto logs\r\n", - " # .filter(\r\n", - " # (F.col(\"Activity\") == \"TRAFFIC\")\r\n", - " # )\r\n", - " .withColumn(\r\n", - " \"DestinationIsPrivate\", F.col(\"DestinationIP\").rlike(PrivateIPregex)\r\n", - " )\r\n", - " .filter(F.col(\"DestinationIsPrivate\") == \"false\")\r\n", - " .withColumn(\"TimeGenerated\", F.col(\"TimeGenerated\").cast(\"timestamp\"))\r\n", - " )\r\n", - "\r\n", - "cooked_df.show()" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Filtering data\n", + "In this step, we will prepare dataset by filtering logs to only destination as Public/external IPs. For this, we are using regex and rlike spark API to filter internal to external network traffic." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Baseline data to filter known Source IP and Destination IPs\r\n", - "\r\n", - "In this step, you can either analyze Historical data or current data to filter source IP and destination IP per defined criteria. \r\n", - "\r\n", - "In below example, we are filtering the Source IP which has daily event count more than the specified threshold.\r\n", - "
Also, you can filter the destination IPs whom very less source IPs are connecting. This will reduce false positives be filtering destination IPs which are commonly seen from internal systems which are likely benign." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "daily_event_count_threshold = 100 # Replace the threshold based on your environment or use default values \r\n", - "degree_of_srcip_threshold = 25 # Replace the threshold based on your environment or use default values \r\n", - "\r\n", - "# Filtering IP list per TotalEventsThreshold\r\n", - "csl_srcip = (\r\n", - " cooked_df.groupBy(\"SourceIP\")\r\n", - " .count()\r\n", - " .filter(F.col(\"count\") > daily_event_count_threshold)\r\n", - " .orderBy(F.col(\"count\"), ascending=False)\r\n", - " )\r\n", - "\r\n", - "# Filtering Destination IP list per Degree of Source IPs threshold\r\n", - "csl_dstip = (\r\n", - " cooked_df.groupBy(\"DestinationIP\")\r\n", - " .agg(F.countDistinct(\"SourceIP\").alias(\"DegreeofSourceIps\"))\r\n", - " .filter(F.col(\"DegreeofSourceIps\") < degree_of_srcip_threshold)\r\n", - " )\r\n", - "\r\n", - "# Filtering IP list per Daily event threshold\r\n", - "baseline_df = (\r\n", - " cooked_df.join(csl_srcip, [\"SourceIP\"])\r\n", - " .join(csl_dstip, [\"DestinationIP\"])\r\n", - " .select(\r\n", - " \"TimeGenerated\",\r\n", - " \"SourceIP\",\r\n", - " \"SourcePort\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " \"Protocol\",\r\n", - " \"ReceivedBytes\",\r\n", - " \"SentBytes\",\r\n", - " \"DeviceVendor\",\r\n", - " )\r\n", - " )\r\n", - "\r\n", - "baseline_df.show()" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "PrivateIPregex = (\"^127\\.|^10\\.|^172\\.1[6-9]\\.|^172\\.2[0-9]\\.|^172\\.3[0-1]\\.|^192\\.168\\.\")\n", + "cooked_df = (current_df # replace historical_df if you want to use historical data\n", + " # use below filter if you have Palo Alto logs\n", + " # .filter(\n", + " # (F.col(\"Activity\") == \"TRAFFIC\")\n", + " # )\n", + " .withColumn(\n", + " \"DestinationIsPrivate\", F.col(\"DestinationIP\").rlike(PrivateIPregex)\n", + " )\n", + " .filter(F.col(\"DestinationIsPrivate\") == \"false\")\n", + " .withColumn(\"TimeGenerated\", F.col(\"TimeGenerated\").cast(\"timestamp\"))\n", + " )\n", + "\n", + "cooked_df.show()" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Rank the datasets and Calculate PercentageBeaconing\r\n", - "\r\n", - "In this step, we will use spark to wrangle the data by applying below transformations.\r\n", - "- Sort the dataset per SourceIP\r\n", - "- Calculate the time difference between First event and next event.\r\n", - "- Partition dataset per Source IP, Destination IP, Destination Port\r\n", - "- Window dataset into consecutive 3 to Calculate the Timedeltalistcount based on cluster of 1-3 timedelta events.\r\n", - "- Calculate percentagebeacon between TotalEventscount and Timedeltalistcount\r\n", - "- Apply thresholds to further reduce false positives\r\n", - "\r\n", - "** SPARK References:**\r\n", - "- https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html\r\n", - "- https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#window" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "time_delta_threshold = 15 # Replace thresholds in seconds interval between 2 successive events. min 10 to anything maximum.\r\n", - "percent_beacon_threshold = 75 # Replace thresholds in percentage. Max value can be 100.\r\n", - "\r\n", - "# Serialize the dataset by sorting timegenerated and partition by SourceIP and WorkspaceId\r\n", - "w = (\r\n", - " Window()\r\n", - " .partitionBy(F.col(\"SourceIP\"))\r\n", - " .orderBy(F.col(\"TimeGenerated\"))\r\n", - " )\r\n", - "\r\n", - "# Calculate new timestamp column of next event\r\n", - "csl_beacon_df = baseline_df.select(\r\n", - " \"*\", lag(\"TimeGenerated\").over(w).alias(\"prev_TimeStamp\")\r\n", - " ).na.drop()\r\n", - "\r\n", - "# Calculate timedelta difference between previoud and next timestamp\r\n", - "timeDiff = F.unix_timestamp(\"TimeGenerated\") - F.unix_timestamp(\"prev_TimeStamp\")\r\n", - "\r\n", - "# Add new column as timedelta\r\n", - "csl_beacon_df = csl_beacon_df.withColumn(\"Timedelta\", timeDiff).filter(\r\n", - " F.col(\"Timedelta\") > time_delta_threshold\r\n", - " )\r\n", - "\r\n", - "csl_beacon_ranked = csl_beacon_df.groupBy(\r\n", - " \"DeviceVendor\",\r\n", - " \"SourceIP\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " \"Protocol\",\r\n", - " \"Timedelta\",\r\n", - " ).agg(\r\n", - " F.count(\"Timedelta\").alias(\"Timedeltacount\"),\r\n", - " F.sum(\"SentBytes\").alias(\"TotalSentBytes\"),\r\n", - " F.sum(\"ReceivedBytes\").alias(\"TotalReceivedBytes\"),\r\n", - " F.count(\"*\").alias(\"Totalevents\"),\r\n", - " )\r\n", - "\r\n", - "w1 = (\r\n", - " Window.partitionBy(\r\n", - " \"DeviceVendor\",\r\n", - " \"SourceIP\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " )\r\n", - " .orderBy(F.col(\"SourceIP\").cast(\"long\"))\r\n", - " .rowsBetween(-2, 0)\r\n", - " )\r\n", - "\r\n", - "csl_beacon_df = (\r\n", - " csl_beacon_ranked\r\n", - " .join(csl_dstip, [\"DestinationIP\"])\r\n", - " .withColumn(\"Timedeltalist\", F.collect_list(\"Timedeltacount\").over(w1))\r\n", - " .withColumn(\r\n", - " \"Timedeltalistcount\",\r\n", - " F.expr(\"AGGREGATE(Timedeltalist, DOUBLE(0), (acc, x) -> acc + x)\").cast(\r\n", - " \"long\"\r\n", - " ),\r\n", - " )\r\n", - " .withColumn(\r\n", - " \"Totaleventcount\",\r\n", - " F.sum(\"Timedeltalistcount\").over(\r\n", - " Window.partitionBy(\"SourceIP\", \"DestinationIP\", \"DestinationPort\")\r\n", - " ),\r\n", - " )\r\n", - " .withColumn(\r\n", - " \"Percentbeacon\",\r\n", - " (\r\n", - " F.col(\"Timedeltalistcount\")\r\n", - " / F.sum(\"Timedeltalistcount\").over(\r\n", - " Window.partitionBy(\r\n", - " \"DeviceVendor\",\r\n", - " \"SourceIP\",\r\n", - " \"DestinationIP\",\r\n", - " \"DestinationPort\",\r\n", - " )\r\n", - " )\r\n", - " * 100.0\r\n", - " ),\r\n", - " )\r\n", - " .filter(F.col(\"Percentbeacon\") > percent_beacon_threshold)\r\n", - " .filter(F.col(\"Totaleventcount\") > daily_event_count_threshold)\r\n", - " .orderBy(F.col(\"Percentbeacon\"), ascending=False)\r\n", - " )\r\n", - "\r\n", - "\r\n", - "csl_beacon_df.show()" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Baseline data to filter known Source IP and Destination IPs\n", + "\n", + "In this step, you can either analyze Historical data or current data to filter source IP and destination IP per defined criteria. \n", + "\n", + "In below example, we are filtering the Source IP which has daily event count more than the specified threshold.\n", + "
Also, you can filter the destination IPs whom very less source IPs are connecting. This will reduce false positives be filtering destination IPs which are commonly seen from internal systems which are likely benign." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Export results from ADLS\r\n", - "In this step, we will save the results from previous step as single json file in ADLS. This file can be exported from ADLS to be used with native python session outside spark pool for more data analysis, visualization etc." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%%synapse\r\n", - "dir_name = \"\" # specify desired directory name\r\n", - "new_path = adls_path + dir_name\r\n", - "csl_beacon_pd = csl_beacon_df.coalesce(1).write.format(\"json\").save(new_path)\r\n" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "daily_event_count_threshold = 100 # Replace the threshold based on your environment or use default values \n", + "degree_of_srcip_threshold = 25 # Replace the threshold based on your environment or use default values \n", + "\n", + "# Filtering IP list per TotalEventsThreshold\n", + "csl_srcip = (\n", + " cooked_df.groupBy(\"SourceIP\")\n", + " .count()\n", + " .filter(F.col(\"count\") > daily_event_count_threshold)\n", + " .orderBy(F.col(\"count\"), ascending=False)\n", + " )\n", + "\n", + "# Filtering Destination IP list per Degree of Source IPs threshold\n", + "csl_dstip = (\n", + " cooked_df.groupBy(\"DestinationIP\")\n", + " .agg(F.countDistinct(\"SourceIP\").alias(\"DegreeofSourceIps\"))\n", + " .filter(F.col(\"DegreeofSourceIps\") < degree_of_srcip_threshold)\n", + " )\n", + "\n", + "# Filtering IP list per Daily event threshold\n", + "baseline_df = (\n", + " cooked_df.join(csl_srcip, [\"SourceIP\"])\n", + " .join(csl_dstip, [\"DestinationIP\"])\n", + " .select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " )\n", + " )\n", + "\n", + "baseline_df.show()" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Stop Spark Session" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "%synapse stop" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Rank the datasets and Calculate PercentageBeaconing\n", + "\n", + "In this step, we will use spark to wrangle the data by applying below transformations.\n", + "- Sort the dataset per SourceIP\n", + "- Calculate the time difference between First event and next event.\n", + "- Partition dataset per Source IP, Destination IP, Destination Port\n", + "- Window dataset into consecutive 3 to Calculate the Timedeltalistcount based on cluster of 1-3 timedelta events.\n", + "- Calculate percentagebeacon between TotalEventscount and Timedeltalistcount\n", + "- Apply thresholds to further reduce false positives\n", + "\n", + "** SPARK References:**\n", + "- https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html\n", + "- https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#window" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Export results from ADLS to local filesystem\r\n" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "def initialize_storage_account(storage_account_name, storage_account_key):\r\n", - "\r\n", - " try:\r\n", - " global service_client\r\n", - "\r\n", - " service_client = DataLakeServiceClient(\r\n", - " account_url=\"{}://{}.dfs.core.windows.net\".format(\r\n", - " \"https\", storage_account_name\r\n", - " ),\r\n", - " credential=storage_account_key,\r\n", - " )\r\n", - "\r\n", - " except Exception as e:\r\n", - " print(e)\r\n", - "\r\n", - "\r\n", - "def list_directory_contents(container_name, input_path, file_type):\r\n", - " try:\r\n", - " file_system_client = service_client.get_file_system_client(\r\n", - " file_system=container_name\r\n", - " )\r\n", - "\r\n", - " paths = file_system_client.get_paths(path=input_path)\r\n", - "\r\n", - " pathlist = []\r\n", - "\r\n", - " for path in paths:\r\n", - " pathlist.append(path.name) if path.name.endswith(file_type) else pathlist\r\n", - "\r\n", - " return pathlist\r\n", - "\r\n", - " except Exception as e:\r\n", - " print(e)\r\n", - "\r\n", - "\r\n", - "def download_file_from_directory(container_name, input_path, input_file):\r\n", - " try:\r\n", - " file_system_client = service_client.get_file_system_client(\r\n", - " file_system=container_name\r\n", - " )\r\n", - "\r\n", - " directory_client = file_system_client.get_directory_client(input_path)\r\n", - "\r\n", - " local_file = open(\"output.json\", \"wb\")\r\n", - "\r\n", - " file_client = directory_client.get_file_client(input_file)\r\n", - "\r\n", - " download = file_client.download_file()\r\n", - "\r\n", - " downloaded_bytes = download.readall()\r\n", - "\r\n", - " local_file.write(downloaded_bytes)\r\n", - "\r\n", - " local_file.close()\r\n", - "\r\n", - " except Exception as e:\r\n", - " print(e)\r\n", - "\r\n", - "\r\n", - "def json_normalize(input_file, output_file):\r\n", - " nwbeaconList = []\r\n", - " with open(input_file) as f:\r\n", - " for jsonObj in f:\r\n", - " nwbeaconDict = json.loads(jsonObj)\r\n", - " nwbeaconList.append(nwbeaconDict)\r\n", - "\r\n", - " with open(output_file, \"w\") as write_file:\r\n", - " json.dump(nwbeaconList, write_file)" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "time_delta_threshold = 15 # Replace thresholds in seconds interval between 2 successive events. min 10 to anything maximum.\n", + "percent_beacon_threshold = 75 # Replace thresholds in percentage. Max value can be 100.\n", + "\n", + "# Serialize the dataset by sorting timegenerated and partition by SourceIP and WorkspaceId\n", + "w = (\n", + " Window()\n", + " .partitionBy(F.col(\"SourceIP\"))\n", + " .orderBy(F.col(\"TimeGenerated\"))\n", + " )\n", + "\n", + "# Calculate new timestamp column of next event\n", + "csl_beacon_df = baseline_df.select(\n", + " \"*\", lag(\"TimeGenerated\").over(w).alias(\"prev_TimeStamp\")\n", + " ).na.drop()\n", + "\n", + "# Calculate timedelta difference between previoud and next timestamp\n", + "timeDiff = F.unix_timestamp(\"TimeGenerated\") - F.unix_timestamp(\"prev_TimeStamp\")\n", + "\n", + "# Add new column as timedelta\n", + "csl_beacon_df = csl_beacon_df.withColumn(\"Timedelta\", timeDiff).filter(\n", + " F.col(\"Timedelta\") > time_delta_threshold\n", + " )\n", + "\n", + "csl_beacon_ranked = csl_beacon_df.groupBy(\n", + " \"DeviceVendor\",\n", + " \"SourceIP\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"Timedelta\",\n", + " ).agg(\n", + " F.count(\"Timedelta\").alias(\"Timedeltacount\"),\n", + " F.sum(\"SentBytes\").alias(\"TotalSentBytes\"),\n", + " F.sum(\"ReceivedBytes\").alias(\"TotalReceivedBytes\"),\n", + " F.count(\"*\").alias(\"Totalevents\"),\n", + " )\n", + "\n", + "w1 = (\n", + " Window.partitionBy(\n", + " \"DeviceVendor\",\n", + " \"SourceIP\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " )\n", + " .orderBy(F.col(\"SourceIP\").cast(\"long\"))\n", + " .rowsBetween(-2, 0)\n", + " )\n", + "\n", + "csl_beacon_df = (\n", + " csl_beacon_ranked\n", + " .join(csl_dstip, [\"DestinationIP\"])\n", + " .withColumn(\"Timedeltalist\", F.collect_list(\"Timedeltacount\").over(w1))\n", + " .withColumn(\n", + " \"Timedeltalistcount\",\n", + " F.expr(\"AGGREGATE(Timedeltalist, DOUBLE(0), (acc, x) -> acc + x)\").cast(\n", + " \"long\"\n", + " ),\n", + " )\n", + " .withColumn(\n", + " \"Totaleventcount\",\n", + " F.sum(\"Timedeltalistcount\").over(\n", + " Window.partitionBy(\"SourceIP\", \"DestinationIP\", \"DestinationPort\")\n", + " ),\n", + " )\n", + " .withColumn(\n", + " \"Percentbeacon\",\n", + " (\n", + " F.col(\"Timedeltalistcount\")\n", + " / F.sum(\"Timedeltalistcount\").over(\n", + " Window.partitionBy(\n", + " \"DeviceVendor\",\n", + " \"SourceIP\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " )\n", + " )\n", + " * 100.0\n", + " ),\n", + " )\n", + " .filter(F.col(\"Percentbeacon\") > percent_beacon_threshold)\n", + " .filter(F.col(\"Totaleventcount\") > daily_event_count_threshold)\n", + " .orderBy(F.col(\"Percentbeacon\"), ascending=False)\n", + " )\n", + "\n", + "\n", + "csl_beacon_df.show()" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Display results" + ] }, - "gather": { - "logged": 1632987153746 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Download the files from ADLS\r\n", - "\r\n", - "In below sections, we will provide input details about ADLS account ad then use functions to connect , list contents and download results locally.\r\n", - "\r\n", - "If you need help in locating input details, follow below steps\r\n", - "- Go to the https://web.azuresynapse.net and sign in to your workspace.\r\n", - "- In Synapse Studio, click Data, select the Linked tab, and select the container under Azure Data Lake Storage Gen2.\r\n", - "- Navigate to folder from the container, right click and select Properies.\r\n", - "- Copy ABFSS path , extact the details and map to the input fields\r\n", - "\r\n", - "\r\n", - "You can check [View account access keys](https://docs.microsoft.com/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#view-account-access-keys) doc to find and retrieve your storage account keys for ADLS account.\r\n", - "\r\n", - "

\r\n", - "Warning: If you are storing secrets such as storage account keys in the notebook you should
\r\n", - "probably opt to store either into msticpyconfig file on the compute instance or use\r\n", - "Azure Key Vault to store the secrets.
\r\n", - "Read more about using KeyVault\r\n", - "in the MSTICPY docs\r\n", - "

" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "# Primary storage info\r\n", - "account_name = \"\" # fill in your primary account name\r\n", - "container_name = \"\" # fill in your container name\r\n", - "subscription_id = \"\"\r\n", - "resource_group = \"\" # fill in your resource gropup for ADLS account\r\n", - "workspace_name = \"\" # fill in your la workspace name\r\n", - "input_path = f\"WorkspaceResourceId=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/providers/microsoft.operationalinsights/workspaces/\"\r\n", - "\r\n", - "adls_path = f\"abfss://{container_name}@{account_name}.dfs.core.windows.net/{input_path}/{workspace_name}\"\r\n", - "\r\n", - "dir_name = \"/\" #Replace the dirname previously specified to store results from spark\r\n", - "account_key = \"\" # Replace your storage account key" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632987687222 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Convert spark dataframe to pandas dataframe\n", + "df = csl_beacon_df.toPandas()\n", + "# Display results\n", + "df.head()" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Enrichment\n", + "In this section, we will enrich entities retrieved from network beaconing behavior such as IP information.\n", + "Types of Enrichment which will beneficial in perfoming investigation will be IP Geolcation , Whois Registrar information and ThreatIntel lookups.\n", + "\n", + "For first time users, please refer `Getting Started Guide For Microsoft Sentinel ML Notebooks` and section [Create your configuration file](https://docs.microsoft.com/azure/sentinel/notebook-get-started#create-your-configuration-file) to create your `mstipyconfig`. " + ] }, - "gather": { - "logged": 1633376635135 - } - } - }, - { - "cell_type": "code", - "source": [ - "new_path = input_path + dir_name\r\n", - "\r\n", - "initialize_storage_account(account_name, account_key)\r\n", - "pathlist = list_directory_contents(container_name, new_path, \"json\")\r\n", - "input_file = pathlist[0].split(\"/\")[-1]\r\n", - "download_file_from_directory(container_name, new_path, input_file)\r\n", - "\r\n", - "json_normalize(\"output.json\", \"out_normalized.json\")" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### IP Geolocation Enrichment\n", + "In this step, we will use msticpy geolocation capabilities using maxmind database. You will need maxmind API key to download the database.\n", + "\n", + "
Note:\n", + "You may see the GeoLite driver downloading its database the first time you run this.\n", + "
\n", + "
\n", + "
\n", + " Learn more about MSTICPy GeoIP providers...\n", + "

\n", + " MSTICPy GeoIP Providers\n", + "

\n", + "
\n", + "
" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632990019315 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "iplocation = GeoLiteLookup()\n", + "\n", + "df = iplocation.df_lookup_ip(df, column=\"DestinationIP\")\n", + "df.head()" + ] }, - "gather": { - "logged": 1632987683046 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Display results" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "df = pd.read_json('out_normalized.json')\r\n", - "\r\n", - "df.head()" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Whois registration enrichment\n", + "In this step, we can perform whois lokup on all public destination ips and populate additional information such as ASN. You can use this output to further filter known ASNs from the results." + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632989270369 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "num_ips = len(df[\"DestinationIP\"].unique())\n", + "print(f\"Performing WhoIs lookups for {num_ips} IPs \", end=\"\")\n", + "df[\"DestASN\"] = df.apply(lambda x: get_whois_info(x.DestinationIP, True), axis=1)\n", + "df[\"DestASNFull\"] = df.apply(lambda x: x.DestASN[1], axis=1)\n", + "df[\"DestASN\"] = df.apply(lambda x: x.DestASN[0], axis=1)\n", + "\n", + "#Display results\n", + "df.head()" + ] }, - "gather": { - "logged": 1632987687222 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Enrich results \r\n", - "In this section, we will enrich entities retrieved from network beaconing behavior such as IP information.\r\n", - "Types of Enrichment which will beneficial in perfoming investigation will be IP Geolcation , Whois Registrar information and ThreatIntel lookups.\r\n", - "\r\n", - "For first time users, please refer `Getting Started Guide For Microsoft Sentinel ML Notebooks` and section [Create your configuration file](https://docs.microsoft.com/azure/sentinel/notebook-get-started#create-your-configuration-file) to create your `mstipyconfig`. " - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### IP Geolocation Enrichment\r\n", - "In this step, we will use msticpy geolocation capabilities using maxmind database. You will need maxmind API key to download the database.\r\n", - "\r\n", - "
Note:\r\n", - "You may see the GeoLite driver downloading its database the first time you run this.\r\n", - "
\r\n", - "
\r\n", - "
\r\n", - " Learn more about MSTICPy GeoIP providers...\r\n", - "

\r\n", - " MSTICPy GeoIP Providers\r\n", - "

\r\n", - "
\r\n", - "
" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "from msticpy.sectools.geoip import GeoLiteLookup\r\n", - "\r\n", - "iplocation = GeoLiteLookup()\r\n", - "\r\n", - "df = iplocation.df_lookup_ip(df, column=\"DestinationIP\")\r\n", - "df.head()" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### ThreatIntel Enrichment\n", + "\n", + "In this step, we can perform threatintel lookup using msticpy and open source TI providers such as IBM Xforce, VirusTotal, Greynoise etc. \n", + "Below example shows performing lookup on single IP as well as bulk lookup on all ips using IBM Xforce TI Provider. \n", + "
You will need to register with IBM Xforce and enter API keys into `mstipyconfig.yaml`\n", + "\n", + "
\n", + " Learn more...\n", + "

\n", + "

\n", + "
    \n", + "
  • More details are shown in the A Tour of Cybersec notebook features notebook
  • \n", + "
  • Threat Intel Lookups in MSTICPy
  • \n", + "
  • To learn more about adding TI sources, see the TI Provider setup in the A Getting Started Guide For Microsoft Sentinel ML Notebooks notebook\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632990440057 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "ti_lookup = TILookup()\n", + "# Perform lookup on single IOC\n", + "result = ti_lookup.lookup_ioc(observable=\"52.183.120.194\", providers=[\"XForce\"])\n", + "ti_lookup.result_to_df(result)" + ] }, - "gather": { - "logged": 1632990019315 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Whois registration enrichment\r\n", - "In this step, we can perform whois lokup on all public destination ips and populate additional information such as ASN. You can use this output to further filter known ASNs from the results." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "num_ips = len(df[\"DestinationIP\"].unique())\r\n", - "print(f\"Performing WhoIs lookups for {num_ips} IPs \", end=\"\")\r\n", - "df[\"DestASN\"] = df.apply(lambda x: get_whois_info(x.DestinationIP, True), axis=1)\r\n", - "df[\"DestASNFull\"] = df.apply(lambda x: x.DestASN[1], axis=1)\r\n", - "df[\"DestASN\"] = df.apply(lambda x: x.DestASN[0], axis=1)\r\n", - "\r\n", - "#Display results\r\n", - "df.head()" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632991107856 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Flattening all the desnation IPs into comma separated list\n", + "ip_list = df['DestinationIP'].astype(str).values.flatten().tolist()\n", + "\n", + "# Perform bulk lookup on all IPs with specified providers\n", + "ti_resp = ti_lookup.lookup_iocs(data=ip_list, providers=[\"AzSTI\", \"XForce\"])\n", + "select_ti = browse_results(ti_resp, severities=['high','warning'])\n", + "select_ti" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Visualization\n", + "MSTICpy also includes a feature to allow you to map locations, this can be particularily useful when looking at the distribution of remote network connections or other events. Below we plot the locations of destination IPs observed in our results." + ] }, - "gather": { - "logged": 1632989270369 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### ThreatIntel Enrichment\r\n", - "\r\n", - "In this step, we can perform threatintel lookup using msticpy and open source TI providers such as IBM Xforce, VirusTotal, Greynoise etc. \r\n", - "Below example shows performing lookup on single IP as well as bulk lookup on all ips using IBM Xforce TI Provider. \r\n", - "
You will need to register with IBM Xforce and enter API keys into `mstipyconfig.yaml`\r\n", - "\r\n", - "
\r\n", - " Learn more...\r\n", - "

\r\n", - "

\r\n", - "
    \r\n", - "
  • More details are shown in the A Tour of Cybersec notebook features notebook
  • \r\n", - "
  • Threat Intel Lookups in MSTICPy
  • \r\n", - "
  • To learn more about adding TI sources, see the TI Provider setup in the A Getting Started Guide For Microsoft Sentinel ML Notebooks notebook\r\n", - "
\r\n", - "
\r\n", - "
\r\n", - "\r\n", - "\r\n" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "code", - "source": [ - "ti_lookup = TILookup()\r\n", - "# Perform lookup on single IOC\r\n", - "result = ti_lookup.lookup_ioc(observable=\"52.183.120.194\", providers=[\"XForce\"])\r\n", - "ti_lookup.result_to_df(result)" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Convert our IP addresses in string format into an ip address entity\n", + "ip_entity = entityschema.IpAddress()\n", + "ip_list = [convert_to_ip_entities(i)[0] for i in df['DestinationIP'].head(10)]\n", + " \n", + "# Get center location of all IP locaitons to center the map on\n", + "location = get_map_center(ip_list)\n", + "logon_map = FoliumMap(location=location, zoom_start=4)\n", + "\n", + "# Add location markers to our map and dsiplay it\n", + "if len(ip_list) > 0:\n", + " logon_map.add_ip_cluster(ip_entities=ip_list)\n", + "display(logon_map.folium_map)" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Ingestion to Sentinel as Custom Table\n", + "Finally, you can also upload the enriched events generated from this notebook as a custom table to Microsoft Sentinel. This will allow to correlate with more data sources available in sentinel and continue investigate or create incidents when malicious activity is confirmed." + ] }, - "gather": { - "logged": 1632990440057 - } - } - }, - { - "cell_type": "code", - "source": [ - "# Flattening all the desnation IPs into comma separated list\r\n", - "ip_list = df['DestinationIP'].astype(str).values.flatten().tolist()\r\n", - "\r\n", - "# Perform bulk lookup on all IPs with specified providers\r\n", - "ti_resp = ti_lookup.lookup_iocs(data=ip_list, providers=[\"AzSTI\", \"XForce\"])\r\n", - "select_ti = browse_results(ti_resp, severities=['high','warning'])\r\n", - "select_ti" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "la_ws_id = widgets.Text(description='Workspace ID:')\n", + "la_ws_key = widgets.Password(description='Workspace Key:')\n", + "display(la_ws_id)\n", + "display(la_ws_key)" + ] }, - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Instantiate data uploader for sentinel\n", + "la_up = LAUploader(workspace=la_ws_id.value, workspace_secret=la_ws_key.value, debug=True)\n", + "# Upload our DataFrame\n", + "la_up.upload_df(data=df, table_name='network_beaconing_anomalies')" + ] }, - "gather": { - "logged": 1632991107856 - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Visualization\r\n", - "MSTICpy also includes a feature to allow you to map locations, this can be particularily useful when looking at the distribution of remote network connections or other events. Below we plot the locations of destination IPs observed in our results." - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Conclusion\n", + "\n", + "We originally started our hunting on very large datasets of firewall logs. Due to the sheer scale of data, we leveraged spark to load the data. \n", + "
We then performed baselining on historical data and use it to further filter current day dataset. In the next step we performed various data transformation by using spark features such as partitioning, windowing, ranking dataset to find outbound network beaconing like behavior.\n", + "
In order to analyze this data further, we enrich IP entities from result dataset with additional information such as Geolocation, whois registration and threat intel lookups. \n", + "\n", + "Analysts can perform further investigation on selected IP addresses from enrichment results by correlating various data sources available. \n", + "You can then create incidents in Microsoft Sentinel and track investigation in it.\n" + ] } - } - }, - { - "cell_type": "code", - "source": [ - "from msticpy.nbtools import entityschema\r\n", - "from msticpy.sectools.ip_utils import convert_to_ip_entities\r\n", - "from msticpy.nbtools.foliummap import FoliumMap, get_map_center\r\n", - "\r\n", - "# Convert our IP addresses in string format into an ip address entity\r\n", - "ip_entity = entityschema.IpAddress()\r\n", - "ip_list = [convert_to_ip_entities(i)[0] for i in df['DestinationIP'].head(10)]\r\n", - " \r\n", - "# Get center location of all IP locaitons to center the map on\r\n", - "location = get_map_center(ip_list)\r\n", - "logon_map = FoliumMap(location=location, zoom_start=4)\r\n", - "\r\n", - "# Add location markers to our map and dsiplay it\r\n", - "if len(ip_list) > 0:\r\n", - " logon_map.add_ip_cluster(ip_entities=ip_list)\r\n", - "display(logon_map.folium_map)" - ], - "outputs": [], - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": false, - "outputs_hidden": false + ], + "metadata": { + "hide_input": false, + "kernel_info": { + "name": "synapse_pyspark" + }, + "kernelspec": { + "display_name": "Python 3.8 - AzureML", + "language": "python", + "name": "python38-azureml" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.10" + }, + "microsoft": { + "host": { + "AzureML": { + "notebookHasBeenCompleted": true + } + }, + "ms_spell_check": { + "ms_spell_check_language": "en" + } }, "nteract": { - "transient": { - "deleting": false - } - } - } - }, - { - "cell_type": "markdown", - "source": [ - "## Conclusion\r\n", - "\r\n", - "We originally started our hunting on very large datasets of firewall logs. Due to the sheer scale of data, we leveraged spark to load the data. \r\n", - "
We then performed baselining on historical data and use it to further filter current day dataset. In the next step we performed various data transformation by using spark features such as paritioning, windowing, ranking datatset to find outbound network beaconing like behavior.\r\n", - "
In order to analyze this data further, we enrich IP entities from result dataset with additional information such as Geolocation, whois registration and threat intel lookups. \r\n", - "\r\n", - "Analysts can perform further investigation on selected IP addresses from enrichment results by correlating various data sources available. \r\n", - "You can then create incidents in Microsoft Sentinel and track investigation in it.\r\n" - ], - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - } - } - ], - "metadata": { - "kernelspec": { - "name": "python38-azureml", - "language": "python", - "display_name": "Python 3.8 - AzureML" - }, - "language_info": { - "name": "python", - "version": "3.8.1", - "mimetype": "text/x-python", - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "pygments_lexer": "ipython3", - "nbconvert_exporter": "python", - "file_extension": ".py" - }, - "kernel_info": { - "name": "python38-azureml" - }, - "microsoft": { - "host": { - "AzureML": { - "notebookHasBeenCompleted": true + "version": "nteract-front-end@1.0.0" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + }, + "varInspector": { + "cols": { + "lenName": 16, + "lenType": 16, + "lenVar": 40 + }, + "kernels_config": { + "python": { + "delete_cmd_postfix": "", + "delete_cmd_prefix": "del ", + "library": "var_list.py", + "varRefreshCmd": "print(var_dic_list())" + }, + "r": { + "delete_cmd_postfix": ") ", + "delete_cmd_prefix": "rm(", + "library": "var_list.r", + "varRefreshCmd": "cat(var_dic_list()) " + } + }, + "types_to_exclude": [ + "module", + "function", + "builtin_function_or_method", + "instance", + "_Feature" + ], + "window_display": false } - } }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 1 } \ No newline at end of file diff --git a/scenario-notebooks/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb b/scenario-notebooks/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb index c88d5248..5c7af3b3 100644 --- a/scenario-notebooks/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb +++ b/scenario-notebooks/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb @@ -1,1474 +1,1197 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse\n", - "\n", - "__Notebook Version:__ 1.0
\n", - "__Python Version:__ Python 3.8 - AzureML
\n", - "__Required Packages:__ azureml-synapse, Msticpy, azure-storage-file-datalake
\n", - "__Platforms Supported:__ Azure Machine Learning Notebooks connected to Azure Synapse Workspace\n", - " \n", - "__Data Source Required:__ Yes\n", - "\n", - "__Data Source:__ CommonSecurityLogs\n", - "\n", - "__Spark Version:__ 3.1 or above\n", - " \n", - "### Description\n", - "In this sample guided scenario notebook, we will demonstrate how to set up continuous data pipeline to store data into azure data lake storage (ADLS) and \n", - "then hunt on that data at scale using distributed processing via Azure Synapse workspace connected to serverless Spark pool. \n", - "Once historical dataset is available in ADLS , we can start performing common hunt operations, create a baseline of normal behavior using PySpark API and also apply data transformations \n", - "to find anomalous behaviors such as periodic network beaconing as explained in the blog - [Detect Network beaconing via Intra-Request time delta patterns in Microsoft Sentinel - Microsoft Tech Community](https://techcommunity.microsoft.com/t5/azure-sentinel/detect-network-beaconing-via-intra-request-time-delta-patterns/ba-p/779586). \n", - "You can use various other spark API to perform other data transformation to understand the data better. \n", - "The output generated can also be further enriched to populate Geolocation information and also visualize using Msticpy capabilities to identify any anomalies. \n", - ".
\n", - "*** Python modules download may be needed. ***
\n", - "*** Please run the cells sequentially to avoid errors. Please do not use \"run all cells\". ***
\n", - "\n", - "## Table of Contents\n", - "1. Warm-up\n", - "2. Authentication to Azure Resources\n", - "3. Configure Azure ML and Azure Synapse Analytics\n", - "4. Load the Historical and current data\n", - "5. Data Wrangling using Spark\n", - "6. Enrich the results\n", - "7. Conclusion\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Warm-up" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "> **Note**: Install below packages only for the first time and restart the kernel once done." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632406406186 + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse\n", + "\n", + "__Notebook Version:__ 1.1
\n", + "__Python Version:__ Python 3.8 - AzureML
\n", + "__Required Packages:__ Msticpy
\n", + "__Platforms Supported:__ Azure Machine Learning Notebooks connected to Azure Synapse Workspace\n", + " \n", + "__Data Source Required:__ Yes\n", + "\n", + "__Data Source:__ CommonSecurityLogs\n", + "\n", + "__Spark Version:__ 3.1 or above\n", + " \n", + "### Description\n", + "In this sample guided scenario notebook, we will demonstrate how to set up continuous data pipeline to store data into azure data lake storage (ADLS) and \n", + "then hunt on that data at scale using distributed processing via Azure Synapse workspace connected to serverless Spark pool. \n", + "Once historical dataset is available in ADLS , we can start performing common hunt operations, create a baseline of normal behavior using PySpark API and also apply data transformations \n", + "to find anomalous behaviors such as periodic network beaconing as explained in the blog - [Detect Network beaconing via Intra-Request time delta patterns in Microsoft Sentinel - Microsoft Tech Community](https://techcommunity.microsoft.com/t5/azure-sentinel/detect-network-beaconing-via-intra-request-time-delta-patterns/ba-p/779586). \n", + "You can use various other spark API to perform other data transformation to understand the data better. \n", + "The output generated can also be further enriched to populate Geolocation information and also visualize using Msticpy capabilities to identify any anomalies. \n", + ".
\n", + "
\n", + "*** Please run the cells sequentially to avoid errors. Please do not use \"run all cells\". ***
\n", + "\n", + "## Table of Contents\n", + "1. Pre-requisites\n", + "2. Data Preparation\n", + "3. Data Wrangling using Spark\n", + "4. Data Enrichment\n", + "5. Data Visualization\n", + "6. Data Ingestion to Sentinel as Custom Table\n", + "7. Conclusion\n", + "\n" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Pre-requisites" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Install AzureML Synapse package to use spark magics\n", - "import sys\n", - "!{sys.executable} -m pip install azureml-synapse" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Set-up Synapse Spark Pool in Azure Machine Learning\n", + "\n", + "This notebook requires Azure ML Spark Compute. If you are using it for the first time, follow the guidelines mentioned here \n", + "[Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-synapse-spark-pool?tabs=studio-ui)\n", + "\n", + "Once you have completed the pre-requisites, you will see AzureML Spark Compute in the dropdown menu for Compute. Select it and run any cell to start Spark Session. \n", + "
Please refer the docs [Managed (Automatic) Spark compute in Azure Machine Learning Notebooks](https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml) for more detailed steps along with screenshots.\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Install Azure storage datalake library to manipulate file systems\n", - "import sys\n", - "!{sys.executable} -m pip install azure-storage-file-datalake --pre" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Installing Libraries\n", + "Also, this notebook requires several libraries to be pre-installed prior to start Spark session. \n", + "
In order to install any libraries, you can use a conda file to configure a Spark session. Please save below file as conda.yml , check the Upload conda file checkbox. Then, select Browse, and choose the conda file saved earlier with the Spark session configuration you want.\n", + "\n", + "```yaml\n", + "name: msticpy\n", + "channels:\n", + "- defaults\n", + "dependencies:\n", + "- bokeh\n", + "- numpy\n", + "- pip:\n", + " - keyring>=13.2.1\n", + " - msticpy[azure]>=2.3.1\n", + "```\n", + "\n", + "### Set-up msticpyconfig.yaml to use with Azure ML with Managed Spark Pool\n", + "\n", + "Once we've done above steps, we need to make sure we have configuration to tell MSTICPy how to connect to your sentinel workspace or any other supported data providers.\n", + "\n", + "This configuration is stored in a configuration file (`msticpyconfig.yaml`).\n", + "\n", + "Learn more...\n", + "Although you don't need to know these details now, you can find more information here:\n", + "\n", + "- [MSTICPy Package Configuration](https://msticpy.readthedocs.io/en/latest/getting_started/msticpyconfig.html)\n", + "- [MSTICPy Settings Editor](https://msticpy.readthedocs.io/en/latest/getting_started/SettingsEditor.html)\n", + "\n", + "If you need a more complete walk-through of configuration, we have a separate notebook to help you:\n", + "- [Configuring Notebook Environment](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb)\n", + "- And for the ultimate walk-through of how to configure all your `msticpyconfig.yaml` settings see the [MPSettingsEditor notebook](https://github.com/microsoft/msticpy/blob/master/docs/notebooks/MPSettingsEditor.ipynb)\n", + "- The Azure-Sentinel-Notebooks GitHub repo also contains an template `msticpyconfig.yaml`, with commented-out sections that may also be helpful in finding your way around the settings if you want to dig into things by hand.\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Install Azure storage datalake library to manipulate file systems\n", - "import sys\n", - "!{sys.executable} -m pip install msticpy" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "*** $\\color{red}{Note:~After~installing~the~packages,~please~restart~the~kernel.}$ ***" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632982861159 + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Set environment variable MSTICPYCONFIG\n", + "Once you have your `msticpyconfig.yaml` ready in your Users directory, set the environment variable with the path to the file. \n", + "\n", + "Default file share is mounted and available to Managed (Automatic) Spark compute. \n", + "
Refer docs - [Accessing data on the default file share](https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml#accessing-data-on-the-default-file-share)\n", + "\n", + "In below code, first we will generate the abspath of the current mount dir and append path where `msticpyconfig.yaml` is stored. \n", + "
Execute the below cell to get the current path and then replace it in the next cell, also replace Username or path to `msticpyconfig.yaml`" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "import os\n", + "from ipywidgets import widgets\n", + "# Provide AML Directory path to msticpyconfig.yaml e.g. /Users/ashwin/msticpyconfig.yaml\n", + "yaml_path = widgets.Text(description='AML Directory Path to msticpyconfig.aml:')\n", + "display(yaml_path)" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Load Python libraries that will be used in this notebook\n", - "from azure.common.client_factory import get_client_from_cli_profile\n", - "from azure.common.credentials import get_azure_cli_credentials\n", - "from azure.mgmt.resource import ResourceManagementClient\n", - "from azureml.core import Workspace, LinkedService, SynapseWorkspaceLinkedServiceConfiguration, Datastore\n", - "from azureml.core.compute import SynapseCompute, ComputeTarget\n", - "from datetime import timedelta, datetime\n", - "from azure.storage.filedatalake import DataLakeServiceClient\n", - "from azure.core._match_conditions import MatchConditions\n", - "from azure.storage.filedatalake._models import ContentSettings\n", - "\n", - "import json\n", - "import os, uuid, sys\n", - "import IPython\n", - "import pandas as pd\n", - "from ipywidgets import widgets, Layout\n", - "from IPython.display import display, HTML\n", - "from pathlib import Path\n", - "\n", - "REQ_PYTHON_VER=(3, 6)\n", - "REQ_MSTICPY_VER=(1, 4, 4)\n", - "\n", - "display(HTML(\"

Starting Notebook setup...

\"))\n", - "\n", - "# If not using Azure Notebooks, install msticpy with\n", - "# %pip install msticpy\n", - "\n", - "from msticpy.nbtools import nbinit\n", - "extra_imports = [\n", - " \"msticpy.nbtools.nbdisplay, draw_alert_entity_graph\",\n", - " \"msticpy.sectools.ip_utils, convert_to_ip_entities\",\n", - " \"msticpy.nbtools.ti_browser, browse_results\",\n", - " \"IPython.display, Image\",\n", - " \"msticpy.sectools.ip_utils, get_whois_info\",\n", - " \"msticpy.sectools.ip_utils, get_ip_type\"\n", - "]\n", - "nbinit.init_notebook(\n", - " namespace=globals(),\n", - " # additional_packages=[\"azureml-synapse\", \"azure-cli\", \"azure-storage-file-datalake\"],\n", - " extra_imports=extra_imports,\n", - ");\n", - "\n", - "\n", - "WIDGET_DEFAULTS = {\n", - " \"layout\": Layout(width=\"95%\"),\n", - " \"style\": {\"description_width\": \"initial\"},\n", - "}\n", - "\n", - "#Set pandas options\n", - "pd.get_option('max_rows',10)\n", - "pd.set_option('max_colwidth',50)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Configure Azure ML and Azure Synapse Analytics\n", - "\n", - "Please use notebook [Configurate Azure ML and Azure Synapse Analytics](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/Configurate%20Azure%20ML%20and%20Azure%20Synapse%20Analytics.ipynb) to configure environment. \n", - "\n", - "The notebook will configure existing Azure synapse workspace to create and connect to Spark pool. You can then create linked service and connect AML workspace to Azure Synapse workspaces.\n", - "
It will also configure data export rules to export data from Log analytics workspace CommonSecurityLog table to Azure Data lake storage Gen 2.\n", - "\n", - "> **Note**: Specify the input parameters in below step in order to connect AML workspace to synapse workspace using linked service." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1678913457350 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Retrieve AML notebook mount path\n", + "abs_path = os.path.abspath('.')\n", + "msticpyconfig_path = f\"{abs_path}{yaml_path.value}\"\n", + "print(msticpyconfig_path)" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "amlworkspace = \"\" # fill in your AML workspace name\n", - "subscription_id = \"\" # fill in your subscription id\n", - "resource_group = '' # fill in your resource groups for AML workspace\n", - "linkedservice = '' # fill in your linked service created to connect to synapse workspace" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Authentication to Azure Resources\n", - "\n", - "In this step we will connect aml workspace to linked service connected to Azure Synapse workspace" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632982882257 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "%env MSTICPYCONFIG = {msticpyconfig}" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Load Python libraries that will be used in this notebook\n", + "import json\n", + "import os, uuid, sys\n", + "import IPython\n", + "import pandas as pd\n", + "from ipywidgets import widgets, Layout\n", + "from IPython.display import display, HTML\n", + "from pathlib import Path\n", + "from datetime import timedelta, datetime\n", + "\n", + "import msticpy as mp\n", + "mp.refresh_config()\n", + "\n", + "from msticpy.context.geoip import GeoLiteLookup\n", + "from msticpy.context.tilookup import TIl\n", + "from msticpy.nbtools import entityschema\n", + "from msticpy.sectools.ip_utils import convert_to_ip_entities\n", + "from msticpy.nbtools.foliummap import FoliumMap, get_map_center\n", + "from msticpy.data.uploaders.loganalytics_uploader import LAUploader\n", + "mp.check_version()\n", + "# mp.init_notebook(globals())\n", + "\n", + "REQ_PYTHON_VER=(3, 6)\n", + "REQ_MSTICPY_VER=(1, 4, 4)\n", + "\n", + "display(HTML(\"

Starting Notebook setup...

\"))\n", + "\n", + "# If not using Azure Notebooks, install msticpy with\n", + "# %pip install msticpy\n", + "\n", + "WIDGET_DEFAULTS = {\n", + " \"layout\": Layout(width=\"95%\"),\n", + " \"style\": {\"description_width\": \"initial\"},\n", + "}\n", + "\n", + "#Set pandas options\n", + "pd.set_option('display.max_rows', 500)\n", + "pd.set_option('display.max_colwidth', 500)" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Get the aml workspace\n", - "aml_workspace = Workspace.get(name=amlworkspace, subscription_id=subscription_id, resource_group=resource_group)\n", - "\n", - "# Retrieve a known linked service\n", - "linked_service = LinkedService.get(aml_workspace, linkedservice)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Start Spark Session\n", - "Enter your Synapse Spark compute below. To find the Spark compute, please follow these steps:
\n", - "1. On the AML Studio left menu, navigate to **Linked Services**
\n", - "2. Click on the name of the Link Service you want to use
\n", - "3. Select **Spark pools** tab
\n", - "4. Get the Name of the Spark pool you want to use.
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Preparation\n", + "In this step, we will define several details associated with ADLS account and specify input date and lookback period to calculate baseline. Based on the input dates and lookback period , we will load the data.\n", + "\n", + "\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "synapse_spark_compute = input('Synapse Spark compute:')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632959931295 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Primary storage info\n", + "account_name = '' # fill in your primary account name\n", + "container_name = '' # fill in your container name\n", + "subscription_id = '' # fill in your subscription id\n", + "resource_group = '' # fill in your resource groups for ADLS\n", + "workspace_name = '' # fill in your workspace name\n", + "device_vendor = \"Fortinet\" # Replace your desired network vendor from commonsecuritylogs\n", + "\n", + "# Datetime and lookback parameters\n", + "end_date = \"\" # fill in your input date\n", + "lookback_days = 21 # fill in lookback days if you want to run it on historical data. make sure you have historical data available in ADLS" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632245154374 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "from pyspark.sql.types import *\n", + "from pyspark.sql.window import Window\n", + "from pyspark.sql.functions import lag, col\n", + "from pyspark.sql.functions import *\n", + "from pyspark.sql import functions as F\n", + "from datetime import timedelta, datetime, date\n", + "\n", + "# Compiling ADLS paths from date string\n", + "end_date_str = end_date.split(\"-\")\n", + "current_path = f\"/y={end_date_str[0]}/m={end_date_str[1]}/d={end_date_str[2]}\"\n", + "\n", + "def generate_adls_paths(end_date, lookback_days, adls_path):\n", + " endDate = datetime.strptime(end_date, '%Y-%m-%d')\n", + " endDate = endDate - timedelta(days=1)\n", + " startDate = endDate - timedelta(days=lookback_days)\n", + " daterange = [startDate + timedelta(days=x) for x in range((endDate-startDate).days + 1)]\n", + "\n", + " pathlist = []\n", + " for day in daterange:\n", + " date_str = day.strftime('%Y-%m-%d').split(\"-\")\n", + " day_path = adls_path + f\"/y={date_str[0]}/m={date_str[1]}/d={date_str[2]}\"\n", + " pathlist.append(day_path)\n", + "\n", + " return pathlist\n", + "\n", + "adls_path = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/WorkspaceResourceId=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/providers/microsoft.operationalinsights/workspaces/{workspace_name}'\n", + "current_day_path = adls_path + current_path\n", + "historical_path = generate_adls_paths(end_date, lookback_days, adls_path)\n" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Start spark session\n", - "%synapse start -s $subscription_id -w $amlworkspace -r $resource_group -c $synapse_spark_compute" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Data Preparation\n", - "In this step, we will define several details associated with ADLS account and specify input date and lookback period to calculate baseline. Based on the input dates and lookback period , we will load the data.\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Load Current day\n", + "\n", + "In this step, you will load the data based on the input date specified." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "\n", - "# Primary storage info\n", - "account_name = '' # fill in your primary account name\n", - "container_name = '' # fill in your container name\n", - "subscription_id = '' # fill in your subscription id\n", - "resource_group = '' # fill in your resource groups for ADLS\n", - "workspace_name = '' # fill in your workspace name\n", - "device_vendor = \"Fortinet\" # Replace your desired network vendor from commonsecuritylogs\n", - "\n", - "# Datetime and lookback parameters\n", - "end_date = \"\" # fill in your input date\n", - "lookback_days = 21 # fill in lookback days if you want to run it on historical data. make sure you have historical data available in ADLS" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632245154374 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "try:\n", + " current_df = (\n", + " spark.read.option(\"recursiveFileLook\", \"true\")\n", + " .option(\"header\", \"true\")\n", + " .json(current_day_path)\n", + " )\n", + "\n", + " current_df = ( \n", + " current_df\n", + " .select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " )\n", + " .filter(F.col(\"DeviceVendor\") == device_vendor)\n", + " )\n", + "except Exception as e:\n", + " print(f\"Could note load the data due to error:{e}\")\n", + "\n", + "#Display the count of records\n", + "print(f\"No of records loaded from the current date specified: {current_df.count()}\")" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Load Historical data\n", + "\n", + "You can also perform the analysis on all historical data available in your ADLS account. The notebook is currently configured to run only on current date specified in input. \n", + "\n", + "If you need to perform the same analysis on historical data, run the cell below and under Data Wrangling using Spark -> Filtering Data code cell, replace `current_df` with `historical_df` variable. \n", + "
**Otherwise SKIP running below cell as it will result in an error if you do not have historical data**" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "from pyspark.sql.types import *\n", - "from pyspark.sql.window import Window\n", - "from pyspark.sql.functions import lag, col\n", - "from pyspark.sql.functions import *\n", - "from pyspark.sql import functions as F\n", - "from datetime import timedelta, datetime, date\n", - "\n", - "# Compiling ADLS paths from date string\n", - "end_date_str = end_date.split(\"-\")\n", - "current_path = f\"/y={end_date_str[0]}/m={end_date_str[1]}/d={end_date_str[2]}\"\n", - "\n", - "def generate_adls_paths(end_date, lookback_days, adls_path):\n", - " endDate = datetime.strptime(end_date, '%Y-%m-%d')\n", - " endDate = endDate - timedelta(days=1)\n", - " startDate = endDate - timedelta(days=lookback_days)\n", - " daterange = [startDate + timedelta(days=x) for x in range((endDate-startDate).days + 1)]\n", - "\n", - " pathlist = []\n", - " for day in daterange:\n", - " date_str = day.strftime('%Y-%m-%d').split(\"-\")\n", - " day_path = adls_path + f\"/y={date_str[0]}/m={date_str[1]}/d={date_str[2]}\"\n", - " pathlist.append(day_path)\n", - "\n", - " return pathlist\n", - "\n", - "adls_path = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/WorkspaceResourceId=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/providers/microsoft.operationalinsights/workspaces/{workspace_name}'\n", - "current_day_path = adls_path + current_path\n", - "historical_path = generate_adls_paths(end_date, lookback_days, adls_path)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Load Current day\n", - "\n", - "In this step, you will load the data based on the input date specified." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "try:\n", + " #Read Previous days data\n", + " historical_df = (\n", + " spark.read.option(\"recursiveFileLook\", \"true\")\n", + " .option(\"header\", \"true\")\n", + " .json(historical_path[-1])\n", + " )\n", + " historical_df = historical_df.select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " ).filter(F.col(\"DeviceVendor\") == device_vendor)\n", + "\n", + " #Read all historical days data per day and union them together\n", + " for path in historical_path[:-1]:\n", + " daily_table = (\n", + " spark.read.option(\"recursiveFileLook\", \"true\")\n", + " .option(\"header\", \"true\")\n", + " .json(path)\n", + " )\n", + " daily_table = daily_table.select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " ).filter(F.col(\"DeviceVendor\") == device_vendor)\n", + " historical_df = historical_df.union(daily_table)\n", + "\n", + "except Exception as e:\n", + " print(f\"Could not load the data due to error:\\n {e}\")\n", + " \n", + "#Display the count of records\n", + "print(f\"\\n No of records loaded from the lookback days specified: {historical_df.count()}\")" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "\n", - "try:\n", - " current_df = (\n", - " spark.read.option(\"recursiveFileLook\", \"true\")\n", - " .option(\"header\", \"true\")\n", - " .json(current_day_path)\n", - " )\n", - "\n", - " current_df = ( \n", - " current_df\n", - " .select(\n", - " \"TimeGenerated\",\n", - " \"SourceIP\",\n", - " \"SourcePort\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " \"Protocol\",\n", - " \"ReceivedBytes\",\n", - " \"SentBytes\",\n", - " \"DeviceVendor\",\n", - " )\n", - " .filter(F.col(\"DeviceVendor\") == device_vendor)\n", - " )\n", - "except Exception as e:\n", - " print(f\"Could note load the data due to error:{e}\")\n", - "\n", - "#Display the count of records\n", - "print(f\"No of records loaded from the current date specified: {current_df.count()}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Load Historical data\n", - "\n", - "You can also perform the analysis on all historical data available in your ADLS account. The notebook is currently configured to run only on current date specified in input. \n", - "\n", - "If you need to perform the same analysis on historical data, run the cell below and under Data Wrangling using Spark -> Filtering Data code cell, replace `current_df` with `historical_df` variable. \n", - "
**Otherwise SKIP running below cell as it will result in an error if you do not have historical data**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Wrangling using Spark" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "\n", - "try:\n", - " #Read Previous days data\n", - " historical_df = (\n", - " spark.read.option(\"recursiveFileLook\", \"true\")\n", - " .option(\"header\", \"true\")\n", - " .json(historical_path[-1])\n", - " )\n", - " historical_df = historical_df.select(\n", - " \"TimeGenerated\",\n", - " \"SourceIP\",\n", - " \"SourcePort\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " \"Protocol\",\n", - " \"ReceivedBytes\",\n", - " \"SentBytes\",\n", - " \"DeviceVendor\",\n", - " ).filter(F.col(\"DeviceVendor\") == device_vendor)\n", - "\n", - " #Read all historical days data per day and union them together\n", - " for path in historical_path[:-1]:\n", - " daily_table = (\n", - " spark.read.option(\"recursiveFileLook\", \"true\")\n", - " .option(\"header\", \"true\")\n", - " .json(path)\n", - " )\n", - " daily_table = daily_table.select(\n", - " \"TimeGenerated\",\n", - " \"SourceIP\",\n", - " \"SourcePort\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " \"Protocol\",\n", - " \"ReceivedBytes\",\n", - " \"SentBytes\",\n", - " \"DeviceVendor\",\n", - " ).filter(F.col(\"DeviceVendor\") == device_vendor)\n", - " historical_df = historical_df.union(daily_table)\n", - "\n", - "except Exception as e:\n", - " print(f\"Could not load the data due to error:\\n {e}\")\n", - " \n", - "#Display the count of records\n", - "print(f\"\\n No of records loaded from the lookback days specified: {historical_df.count()}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Data Wrangling using Spark" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Filtering data\n", - "In this step, we will prepare dataset by filtering logs to only destination as Public/external IPs. For this, we are using regex and rlike spark API to filter internal to external network traffic." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Filtering data\n", + "In this step, we will prepare dataset by filtering logs to only destination as Public/external IPs. For this, we are using regex and rlike spark API to filter internal to external network traffic." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "\n", - "PrivateIPregex = (\"^127\\.|^10\\.|^172\\.1[6-9]\\.|^172\\.2[0-9]\\.|^172\\.3[0-1]\\.|^192\\.168\\.\")\n", - "cooked_df = (current_df # replace historical_df if you want to use historical data\n", - " # use below filter if you have Palo Alto logs\n", - " # .filter(\n", - " # (F.col(\"Activity\") == \"TRAFFIC\")\n", - " # )\n", - " .withColumn(\n", - " \"DestinationIsPrivate\", F.col(\"DestinationIP\").rlike(PrivateIPregex)\n", - " )\n", - " .filter(F.col(\"DestinationIsPrivate\") == \"false\")\n", - " .withColumn(\"TimeGenerated\", F.col(\"TimeGenerated\").cast(\"timestamp\"))\n", - " )\n", - "\n", - "cooked_df.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Baseline data to filter known Source IP and Destination IPs\n", - "\n", - "In this step, you can either analyze Historical data or current data to filter source IP and destination IP per defined criteria. \n", - "\n", - "In below example, we are filtering the Source IP which has daily event count more than the specified threshold.\n", - "
Also, you can filter the destination IPs whom very less source IPs are connecting. This will reduce false positives be filtering destination IPs which are commonly seen from internal systems which are likely benign." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "PrivateIPregex = (\"^127\\.|^10\\.|^172\\.1[6-9]\\.|^172\\.2[0-9]\\.|^172\\.3[0-1]\\.|^192\\.168\\.\")\n", + "cooked_df = (current_df # replace historical_df if you want to use historical data\n", + " # use below filter if you have Palo Alto logs\n", + " # .filter(\n", + " # (F.col(\"Activity\") == \"TRAFFIC\")\n", + " # )\n", + " .withColumn(\n", + " \"DestinationIsPrivate\", F.col(\"DestinationIP\").rlike(PrivateIPregex)\n", + " )\n", + " .filter(F.col(\"DestinationIsPrivate\") == \"false\")\n", + " .withColumn(\"TimeGenerated\", F.col(\"TimeGenerated\").cast(\"timestamp\"))\n", + " )\n", + "\n", + "cooked_df.show()" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "daily_event_count_threshold = 100 # Replace the threshold based on your environment or use default values \n", - "degree_of_srcip_threshold = 25 # Replace the threshold based on your environment or use default values \n", - "\n", - "# Filtering IP list per TotalEventsThreshold\n", - "csl_srcip = (\n", - " cooked_df.groupBy(\"SourceIP\")\n", - " .count()\n", - " .filter(F.col(\"count\") > daily_event_count_threshold)\n", - " .orderBy(F.col(\"count\"), ascending=False)\n", - " )\n", - "\n", - "# Filtering Destination IP list per Degree of Source IPs threshold\n", - "csl_dstip = (\n", - " cooked_df.groupBy(\"DestinationIP\")\n", - " .agg(F.countDistinct(\"SourceIP\").alias(\"DegreeofSourceIps\"))\n", - " .filter(F.col(\"DegreeofSourceIps\") < degree_of_srcip_threshold)\n", - " )\n", - "\n", - "# Filtering IP list per Daily event threshold\n", - "baseline_df = (\n", - " cooked_df.join(csl_srcip, [\"SourceIP\"])\n", - " .join(csl_dstip, [\"DestinationIP\"])\n", - " .select(\n", - " \"TimeGenerated\",\n", - " \"SourceIP\",\n", - " \"SourcePort\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " \"Protocol\",\n", - " \"ReceivedBytes\",\n", - " \"SentBytes\",\n", - " \"DeviceVendor\",\n", - " )\n", - " )\n", - "\n", - "baseline_df.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Rank the datasets and Calculate PercentageBeaconing\n", - "\n", - "In this step, we will use spark to wrangle the data by applying below transformations.\n", - "- Sort the dataset per SourceIP\n", - "- Calculate the time difference between First event and next event.\n", - "- Partition dataset per Source IP, Destination IP, Destination Port\n", - "- Window dataset into consecutive 3 to Calculate the Timedeltalistcount based on cluster of 1-3 timedelta events.\n", - "- Calculate percentagebeacon between TotalEventscount and Timedeltalistcount\n", - "- Apply thresholds to further reduce false positives\n", - "\n", - "** SPARK References:**\n", - "- https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html\n", - "- https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#window" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Baseline data to filter known Source IP and Destination IPs\n", + "\n", + "In this step, you can either analyze Historical data or current data to filter source IP and destination IP per defined criteria. \n", + "\n", + "In below example, we are filtering the Source IP which has daily event count more than the specified threshold.\n", + "
Also, you can filter the destination IPs whom very less source IPs are connecting. This will reduce false positives be filtering destination IPs which are commonly seen from internal systems which are likely benign." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "time_delta_threshold = 15 # Replace thresholds in seconds interval between 2 successive events. min 10 to anything maximum.\n", - "percent_beacon_threshold = 75 # Replace thresholds in percentage. Max value can be 100.\n", - "\n", - "# Serialize the dataset by sorting timegenerated and partition by SourceIP and WorkspaceId\n", - "w = (\n", - " Window()\n", - " .partitionBy(F.col(\"SourceIP\"))\n", - " .orderBy(F.col(\"TimeGenerated\"))\n", - " )\n", - "\n", - "# Calculate new timestamp column of next event\n", - "csl_beacon_df = baseline_df.select(\n", - " \"*\", lag(\"TimeGenerated\").over(w).alias(\"prev_TimeStamp\")\n", - " ).na.drop()\n", - "\n", - "# Calculate timedelta difference between previoud and next timestamp\n", - "timeDiff = F.unix_timestamp(\"TimeGenerated\") - F.unix_timestamp(\"prev_TimeStamp\")\n", - "\n", - "# Add new column as timedelta\n", - "csl_beacon_df = csl_beacon_df.withColumn(\"Timedelta\", timeDiff).filter(\n", - " F.col(\"Timedelta\") > time_delta_threshold\n", - " )\n", - "\n", - "csl_beacon_ranked = csl_beacon_df.groupBy(\n", - " \"DeviceVendor\",\n", - " \"SourceIP\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " \"Protocol\",\n", - " \"Timedelta\",\n", - " ).agg(\n", - " F.count(\"Timedelta\").alias(\"Timedeltacount\"),\n", - " F.sum(\"SentBytes\").alias(\"TotalSentBytes\"),\n", - " F.sum(\"ReceivedBytes\").alias(\"TotalReceivedBytes\"),\n", - " F.count(\"*\").alias(\"Totalevents\"),\n", - " )\n", - "\n", - "w1 = (\n", - " Window.partitionBy(\n", - " \"DeviceVendor\",\n", - " \"SourceIP\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " )\n", - " .orderBy(F.col(\"SourceIP\").cast(\"long\"))\n", - " .rowsBetween(-2, 0)\n", - " )\n", - "\n", - "csl_beacon_df = (\n", - " csl_beacon_ranked\n", - " .join(csl_dstip, [\"DestinationIP\"])\n", - " .withColumn(\"Timedeltalist\", F.collect_list(\"Timedeltacount\").over(w1))\n", - " .withColumn(\n", - " \"Timedeltalistcount\",\n", - " F.expr(\"AGGREGATE(Timedeltalist, DOUBLE(0), (acc, x) -> acc + x)\").cast(\n", - " \"long\"\n", - " ),\n", - " )\n", - " .withColumn(\n", - " \"Totaleventcount\",\n", - " F.sum(\"Timedeltalistcount\").over(\n", - " Window.partitionBy(\"SourceIP\", \"DestinationIP\", \"DestinationPort\")\n", - " ),\n", - " )\n", - " .withColumn(\n", - " \"Percentbeacon\",\n", - " (\n", - " F.col(\"Timedeltalistcount\")\n", - " / F.sum(\"Timedeltalistcount\").over(\n", - " Window.partitionBy(\n", - " \"DeviceVendor\",\n", - " \"SourceIP\",\n", - " \"DestinationIP\",\n", - " \"DestinationPort\",\n", - " )\n", - " )\n", - " * 100.0\n", - " ),\n", - " )\n", - " .filter(F.col(\"Percentbeacon\") > percent_beacon_threshold)\n", - " .filter(F.col(\"Totaleventcount\") > daily_event_count_threshold)\n", - " .orderBy(F.col(\"Percentbeacon\"), ascending=False)\n", - " )\n", - "\n", - "\n", - "csl_beacon_df.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Export results from ADLS\n", - "In this step, we will save the results from previous step as single json file in ADLS. This file can be exported from ADLS to be used with native python session outside spark pool for more data analysis, visualization etc." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "daily_event_count_threshold = 100 # Replace the threshold based on your environment or use default values \n", + "degree_of_srcip_threshold = 25 # Replace the threshold based on your environment or use default values \n", + "\n", + "# Filtering IP list per TotalEventsThreshold\n", + "csl_srcip = (\n", + " cooked_df.groupBy(\"SourceIP\")\n", + " .count()\n", + " .filter(F.col(\"count\") > daily_event_count_threshold)\n", + " .orderBy(F.col(\"count\"), ascending=False)\n", + " )\n", + "\n", + "# Filtering Destination IP list per Degree of Source IPs threshold\n", + "csl_dstip = (\n", + " cooked_df.groupBy(\"DestinationIP\")\n", + " .agg(F.countDistinct(\"SourceIP\").alias(\"DegreeofSourceIps\"))\n", + " .filter(F.col(\"DegreeofSourceIps\") < degree_of_srcip_threshold)\n", + " )\n", + "\n", + "# Filtering IP list per Daily event threshold\n", + "baseline_df = (\n", + " cooked_df.join(csl_srcip, [\"SourceIP\"])\n", + " .join(csl_dstip, [\"DestinationIP\"])\n", + " .select(\n", + " \"TimeGenerated\",\n", + " \"SourceIP\",\n", + " \"SourcePort\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"ReceivedBytes\",\n", + " \"SentBytes\",\n", + " \"DeviceVendor\",\n", + " )\n", + " )\n", + "\n", + "baseline_df.show()" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%%synapse\n", - "dir_name = \"\" # specify desired directory name\n", - "new_path = adls_path + dir_name\n", - "csl_beacon_pd = csl_beacon_df.coalesce(1).write.format(\"json\").save(new_path)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Stop Spark Session" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Rank the datasets and Calculate PercentageBeaconing\n", + "\n", + "In this step, we will use spark to wrangle the data by applying below transformations.\n", + "- Sort the dataset per SourceIP\n", + "- Calculate the time difference between First event and next event.\n", + "- Partition dataset per Source IP, Destination IP, Destination Port\n", + "- Window dataset into consecutive 3 to Calculate the Timedeltalistcount based on cluster of 1-3 timedelta events.\n", + "- Calculate percentagebeacon between TotalEventscount and Timedeltalistcount\n", + "- Apply thresholds to further reduce false positives\n", + "\n", + "** SPARK References:**\n", + "- https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html\n", + "- https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#window" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "%synapse stop" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Export results from ADLS to local filesystem\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632987153746 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "time_delta_threshold = 15 # Replace thresholds in seconds interval between 2 successive events. min 10 to anything maximum.\n", + "percent_beacon_threshold = 75 # Replace thresholds in percentage. Max value can be 100.\n", + "\n", + "# Serialize the dataset by sorting timegenerated and partition by SourceIP and WorkspaceId\n", + "w = (\n", + " Window()\n", + " .partitionBy(F.col(\"SourceIP\"))\n", + " .orderBy(F.col(\"TimeGenerated\"))\n", + " )\n", + "\n", + "# Calculate new timestamp column of next event\n", + "csl_beacon_df = baseline_df.select(\n", + " \"*\", lag(\"TimeGenerated\").over(w).alias(\"prev_TimeStamp\")\n", + " ).na.drop()\n", + "\n", + "# Calculate timedelta difference between previoud and next timestamp\n", + "timeDiff = F.unix_timestamp(\"TimeGenerated\") - F.unix_timestamp(\"prev_TimeStamp\")\n", + "\n", + "# Add new column as timedelta\n", + "csl_beacon_df = csl_beacon_df.withColumn(\"Timedelta\", timeDiff).filter(\n", + " F.col(\"Timedelta\") > time_delta_threshold\n", + " )\n", + "\n", + "csl_beacon_ranked = csl_beacon_df.groupBy(\n", + " \"DeviceVendor\",\n", + " \"SourceIP\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " \"Protocol\",\n", + " \"Timedelta\",\n", + " ).agg(\n", + " F.count(\"Timedelta\").alias(\"Timedeltacount\"),\n", + " F.sum(\"SentBytes\").alias(\"TotalSentBytes\"),\n", + " F.sum(\"ReceivedBytes\").alias(\"TotalReceivedBytes\"),\n", + " F.count(\"*\").alias(\"Totalevents\"),\n", + " )\n", + "\n", + "w1 = (\n", + " Window.partitionBy(\n", + " \"DeviceVendor\",\n", + " \"SourceIP\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " )\n", + " .orderBy(F.col(\"SourceIP\").cast(\"long\"))\n", + " .rowsBetween(-2, 0)\n", + " )\n", + "\n", + "csl_beacon_df = (\n", + " csl_beacon_ranked\n", + " .join(csl_dstip, [\"DestinationIP\"])\n", + " .withColumn(\"Timedeltalist\", F.collect_list(\"Timedeltacount\").over(w1))\n", + " .withColumn(\n", + " \"Timedeltalistcount\",\n", + " F.expr(\"AGGREGATE(Timedeltalist, DOUBLE(0), (acc, x) -> acc + x)\").cast(\n", + " \"long\"\n", + " ),\n", + " )\n", + " .withColumn(\n", + " \"Totaleventcount\",\n", + " F.sum(\"Timedeltalistcount\").over(\n", + " Window.partitionBy(\"SourceIP\", \"DestinationIP\", \"DestinationPort\")\n", + " ),\n", + " )\n", + " .withColumn(\n", + " \"Percentbeacon\",\n", + " (\n", + " F.col(\"Timedeltalistcount\")\n", + " / F.sum(\"Timedeltalistcount\").over(\n", + " Window.partitionBy(\n", + " \"DeviceVendor\",\n", + " \"SourceIP\",\n", + " \"DestinationIP\",\n", + " \"DestinationPort\",\n", + " )\n", + " )\n", + " * 100.0\n", + " ),\n", + " )\n", + " .filter(F.col(\"Percentbeacon\") > percent_beacon_threshold)\n", + " .filter(F.col(\"Totaleventcount\") > daily_event_count_threshold)\n", + " .orderBy(F.col(\"Percentbeacon\"), ascending=False)\n", + " )\n", + "\n", + "\n", + "csl_beacon_df.show()" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Display results" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "def initialize_storage_account(storage_account_name, storage_account_key):\n", - "\n", - " try:\n", - " global service_client\n", - "\n", - " service_client = DataLakeServiceClient(\n", - " account_url=\"{}://{}.dfs.core.windows.net\".format(\n", - " \"https\", storage_account_name\n", - " ),\n", - " credential=storage_account_key,\n", - " )\n", - "\n", - " except Exception as e:\n", - " print(e)\n", - "\n", - "\n", - "def list_directory_contents(container_name, input_path, file_type):\n", - " try:\n", - " file_system_client = service_client.get_file_system_client(\n", - " file_system=container_name\n", - " )\n", - "\n", - " paths = file_system_client.get_paths(path=input_path)\n", - "\n", - " pathlist = []\n", - "\n", - " for path in paths:\n", - " pathlist.append(path.name) if path.name.endswith(file_type) else pathlist\n", - "\n", - " return pathlist\n", - "\n", - " except Exception as e:\n", - " print(e)\n", - "\n", - "\n", - "def download_file_from_directory(container_name, input_path, input_file):\n", - " try:\n", - " file_system_client = service_client.get_file_system_client(\n", - " file_system=container_name\n", - " )\n", - "\n", - " directory_client = file_system_client.get_directory_client(input_path)\n", - "\n", - " local_file = open(\"output.json\", \"wb\")\n", - "\n", - " file_client = directory_client.get_file_client(input_file)\n", - "\n", - " download = file_client.download_file()\n", - "\n", - " downloaded_bytes = download.readall()\n", - "\n", - " local_file.write(downloaded_bytes)\n", - "\n", - " local_file.close()\n", - "\n", - " except Exception as e:\n", - " print(e)\n", - "\n", - "\n", - "def json_normalize(input_file, output_file):\n", - " nwbeaconList = []\n", - " with open(input_file) as f:\n", - " for jsonObj in f:\n", - " nwbeaconDict = json.loads(jsonObj)\n", - " nwbeaconList.append(nwbeaconDict)\n", - "\n", - " with open(output_file, \"w\") as write_file:\n", - " json.dump(nwbeaconList, write_file)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Download the files from ADLS\n", - "\n", - "In below sections, we will provide input details about ADLS account ad then use functions to connect , list contents and download results locally.\n", - "\n", - "If you need help in locating input details, follow below steps\n", - "- Go to the https://web.azuresynapse.net and sign in to your workspace.\n", - "- In Synapse Studio, click Data, select the Linked tab, and select the container under Azure Data Lake Storage Gen2.\n", - "- Navigate to folder from the container, right click and select Properies.\n", - "- Copy ABFSS path , extact the details and map to the input fields\n", - "\n", - "\n", - "You can check [View account access keys](https://docs.microsoft.com/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#view-account-access-keys) doc to find and retrieve your storage account keys for ADLS account.\n", - "\n", - "

\n", - "Warning: If you are storing secrets such as storage account keys in the notebook you should
\n", - "probably opt to store either into msticpyconfig file on the compute instance or use\n", - "Azure Key Vault to store the secrets.
\n", - "Read more about using KeyVault\n", - "in the MSTICPY docs\n", - "

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1633376635135 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632987687222 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Convert spark dataframe to pandas dataframe\n", + "df = csl_beacon_df.toPandas()\n", + "# Display results\n", + "df.head()" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Enrichment\n", + "In this section, we will enrich entities retrieved from network beaconing behavior such as IP information.\n", + "Types of Enrichment which will beneficial in perfoming investigation will be IP Geolcation , Whois Registrar information and ThreatIntel lookups.\n", + "\n", + "For first time users, please refer `Getting Started Guide For Microsoft Sentinel ML Notebooks` and section [Create your configuration file](https://docs.microsoft.com/azure/sentinel/notebook-get-started#create-your-configuration-file) to create your `mstipyconfig`. " + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Primary storage info\n", - "account_name = \"\" # fill in your primary account name\n", - "container_name = \"\" # fill in your container name\n", - "subscription_id = \"\"\n", - "resource_group = \"\" # fill in your resource gropup for ADLS account\n", - "workspace_name = \"\" # fill in your la workspace name\n", - "input_path = f\"WorkspaceResourceId=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/providers/microsoft.operationalinsights/workspaces/\"\n", - "\n", - "adls_path = f\"abfss://{container_name}@{account_name}.dfs.core.windows.net/{input_path}/{workspace_name}\"\n", - "\n", - "dir_name = \"/\" #Replace the dirname previously specified to store results from spark\n", - "account_key = \"\" # Replace your storage account key" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632987683046 + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### IP Geolocation Enrichment\n", + "In this step, we will use msticpy geolocation capabilities using maxmind database. You will need maxmind API key to download the database.\n", + "\n", + "
Note:\n", + "You may see the GeoLite driver downloading its database the first time you run this.\n", + "
\n", + "
\n", + "
\n", + " Learn more about MSTICPy GeoIP providers...\n", + "

\n", + " MSTICPy GeoIP Providers\n", + "

\n", + "
\n", + "
" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632990019315 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "iplocation = GeoLiteLookup()\n", + "\n", + "df = iplocation.df_lookup_ip(df, column=\"DestinationIP\")\n", + "df.head()" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "new_path = input_path + dir_name\n", - "\n", - "initialize_storage_account(account_name, account_key)\n", - "pathlist = list_directory_contents(container_name, new_path, \"json\")\n", - "input_file = pathlist[0].split(\"/\")[-1]\n", - "download_file_from_directory(container_name, new_path, input_file)\n", - "\n", - "json_normalize(\"output.json\", \"out_normalized.json\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Display results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632987687222 + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### Whois registration enrichment\n", + "In this step, we can perform whois lokup on all public destination ips and populate additional information such as ASN. You can use this output to further filter known ASNs from the results." + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632989270369 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "num_ips = len(df[\"DestinationIP\"].unique())\n", + "print(f\"Performing WhoIs lookups for {num_ips} IPs \", end=\"\")\n", + "df[\"DestASN\"] = df.apply(lambda x: get_whois_info(x.DestinationIP, True), axis=1)\n", + "df[\"DestASNFull\"] = df.apply(lambda x: x.DestASN[1], axis=1)\n", + "df[\"DestASN\"] = df.apply(lambda x: x.DestASN[0], axis=1)\n", + "\n", + "#Display results\n", + "df.head()" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "df = pd.read_json('out_normalized.json')\n", - "\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Enrich results \n", - "In this section, we will enrich entities retrieved from network beaconing behavior such as IP information.\n", - "Types of Enrichment which will beneficial in perfoming investigation will be IP Geolcation , Whois Registrar information and ThreatIntel lookups.\n", - "\n", - "For first time users, please refer `Getting Started Guide For Microsoft Sentinel ML Notebooks` and section [Create your configuration file](https://docs.microsoft.com/azure/sentinel/notebook-get-started#create-your-configuration-file) to create your `mstipyconfig`. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### IP Geolocation Enrichment\n", - "In this step, we will use msticpy geolocation capabilities using maxmind database. You will need maxmind API key to download the database.\n", - "\n", - "
Note:\n", - "You may see the GeoLite driver downloading its database the first time you run this.\n", - "
\n", - "
\n", - "
\n", - " Learn more about MSTICPy GeoIP providers...\n", - "

\n", - " MSTICPy GeoIP Providers\n", - "

\n", - "
\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632990019315 + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "### ThreatIntel Enrichment\n", + "\n", + "In this step, we can perform threatintel lookup using msticpy and open source TI providers such as IBM Xforce, VirusTotal, Greynoise etc. \n", + "Below example shows performing lookup on single IP as well as bulk lookup on all ips using IBM Xforce TI Provider. \n", + "
You will need to register with IBM Xforce and enter API keys into `mstipyconfig.yaml`\n", + "\n", + "
\n", + " Learn more...\n", + "

\n", + "

\n", + "
    \n", + "
  • More details are shown in the A Tour of Cybersec notebook features notebook
  • \n", + "
  • Threat Intel Lookups in MSTICPy
  • \n", + "
  • To learn more about adding TI sources, see the TI Provider setup in the A Getting Started Guide For Microsoft Sentinel ML Notebooks notebook\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632990440057 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "ti_lookup = TILookup()\n", + "# Perform lookup on single IOC\n", + "result = ti_lookup.lookup_ioc(observable=\"52.183.120.194\", providers=[\"XForce\"])\n", + "ti_lookup.result_to_df(result)" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "from msticpy.sectools.geoip import GeoLiteLookup\n", - "\n", - "iplocation = GeoLiteLookup()\n", - "\n", - "df = iplocation.df_lookup_ip(df, column=\"DestinationIP\")\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Whois registration enrichment\n", - "In this step, we can perform whois lokup on all public destination ips and populate additional information such as ASN. You can use this output to further filter known ASNs from the results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632989270369 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "gather": { + "logged": 1632991107856 + }, + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Flattening all the desnation IPs into comma separated list\n", + "ip_list = df['DestinationIP'].astype(str).values.flatten().tolist()\n", + "\n", + "# Perform bulk lookup on all IPs with specified providers\n", + "ti_resp = ti_lookup.lookup_iocs(data=ip_list, providers=[\"AzSTI\", \"XForce\"])\n", + "select_ti = browse_results(ti_resp, severities=['high','warning'])\n", + "select_ti" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Visualization\n", + "MSTICpy also includes a feature to allow you to map locations, this can be particularily useful when looking at the distribution of remote network connections or other events. Below we plot the locations of destination IPs observed in our results." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "num_ips = len(df[\"DestinationIP\"].unique())\n", - "print(f\"Performing WhoIs lookups for {num_ips} IPs \", end=\"\")\n", - "df[\"DestASN\"] = df.apply(lambda x: get_whois_info(x.DestinationIP, True), axis=1)\n", - "df[\"DestASNFull\"] = df.apply(lambda x: x.DestASN[1], axis=1)\n", - "df[\"DestASN\"] = df.apply(lambda x: x.DestASN[0], axis=1)\n", - "\n", - "#Display results\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### ThreatIntel Enrichment\n", - "\n", - "In this step, we can perform threatintel lookup using msticpy and open source TI providers such as IBM Xforce, VirusTotal, Greynoise etc. \n", - "Below example shows performing lookup on single IP as well as bulk lookup on all ips using IBM Xforce TI Provider. \n", - "
You will need to register with IBM Xforce and enter API keys into `mstipyconfig.yaml`\n", - "\n", - "
\n", - " Learn more...\n", - "

\n", - "

\n", - "
    \n", - "
  • More details are shown in the A Tour of Cybersec notebook features notebook
  • \n", - "
  • Threat Intel Lookups in MSTICPy
  • \n", - "
  • To learn more about adding TI sources, see the TI Provider setup in the A Getting Started Guide For Microsoft Sentinel ML Notebooks notebook\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632990440057 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Convert our IP addresses in string format into an ip address entity\n", + "ip_entity = entityschema.IpAddress()\n", + "ip_list = [convert_to_ip_entities(i)[0] for i in df['DestinationIP'].head(10)]\n", + " \n", + "# Get center location of all IP locaitons to center the map on\n", + "location = get_map_center(ip_list)\n", + "logon_map = FoliumMap(location=location, zoom_start=4)\n", + "\n", + "# Add location markers to our map and dsiplay it\n", + "if len(ip_list) > 0:\n", + " logon_map.add_ip_cluster(ip_entities=ip_list)\n", + "display(logon_map.folium_map)" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Data Ingestion to Sentinel as Custom Table\n", + "Finally, you can also upload the enriched events generated from this notebook as a custom table to Microsoft Sentinel. This will allow to correlate with more data sources available in sentinel and continue investigate or create incidents when malicious activity is confirmed." + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "ti_lookup = TILookup()\n", - "# Perform lookup on single IOC\n", - "result = ti_lookup.lookup_ioc(observable=\"52.183.120.194\", providers=[\"XForce\"])\n", - "ti_lookup.result_to_df(result)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1632991107856 + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "la_ws_id = widgets.Text(description='Workspace ID:')\n", + "la_ws_key = widgets.Password(description='Workspace Key:')\n", + "display(la_ws_id)\n", + "display(la_ws_key)" + ] }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "# Instantiate data uploader for sentinel\n", + "la_up = LAUploader(workspace=la_ws_id.value, workspace_secret=la_ws_key.value, debug=True)\n", + "# Upload our DataFrame\n", + "la_up.upload_df(data=df, table_name='network_beaconing_anomalies')" + ] }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# Flattening all the desnation IPs into comma separated list\n", - "ip_list = df['DestinationIP'].astype(str).values.flatten().tolist()\n", - "\n", - "# Perform bulk lookup on all IPs with specified providers\n", - "ti_resp = ti_lookup.lookup_iocs(data=ip_list, providers=[\"AzSTI\", \"XForce\"])\n", - "select_ti = browse_results(ti_resp, severities=['high','warning'])\n", - "select_ti" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } + { + "cell_type": "markdown", + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Conclusion\n", + "\n", + "We originally started our hunting on very large datasets of firewall logs. Due to the sheer scale of data, we leveraged spark to load the data. \n", + "
We then performed baselining on historical data and use it to further filter current day dataset. In the next step we performed various data transformation by using spark features such as partitioning, windowing, ranking dataset to find outbound network beaconing like behavior.\n", + "
In order to analyze this data further, we enrich IP entities from result dataset with additional information such as Geolocation, whois registration and threat intel lookups. \n", + "\n", + "Analysts can perform further investigation on selected IP addresses from enrichment results by correlating various data sources available. \n", + "You can then create incidents in Microsoft Sentinel and track investigation in it.\n" + ] } - }, - "source": [ - "## Visualization\n", - "MSTICpy also includes a feature to allow you to map locations, this can be particularily useful when looking at the distribution of remote network connections or other events. Below we plot the locations of destination IPs observed in our results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false + ], + "metadata": { + "hide_input": false, + "kernel_info": { + "name": "synapse_pyspark" + }, + "kernelspec": { + "display_name": "Python 3.8 - AzureML", + "language": "python", + "name": "python38-azureml" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.10" + }, + "microsoft": { + "host": { + "AzureML": { + "notebookHasBeenCompleted": true + } + }, + "ms_spell_check": { + "ms_spell_check_language": "en" + } }, "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "from msticpy.nbtools import entityschema\n", - "from msticpy.sectools.ip_utils import convert_to_ip_entities\n", - "from msticpy.nbtools.foliummap import FoliumMap, get_map_center\n", - "\n", - "# Convert our IP addresses in string format into an ip address entity\n", - "ip_entity = entityschema.IpAddress()\n", - "ip_list = [convert_to_ip_entities(i)[0] for i in df['DestinationIP'].head(10)]\n", - " \n", - "# Get center location of all IP locaitons to center the map on\n", - "location = get_map_center(ip_list)\n", - "logon_map = FoliumMap(location=location, zoom_start=4)\n", - "\n", - "# Add location markers to our map and dsiplay it\n", - "if len(ip_list) > 0:\n", - " logon_map.add_ip_cluster(ip_entities=ip_list)\n", - "display(logon_map.folium_map)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Conclusion\n", - "\n", - "We originally started our hunting on very large datasets of firewall logs. Due to the sheer scale of data, we leveraged spark to load the data. \n", - "
We then performed baselining on historical data and use it to further filter current day dataset. In the next step we performed various data transformation by using spark features such as paritioning, windowing, ranking datatset to find outbound network beaconing like behavior.\n", - "
In order to analyze this data further, we enrich IP entities from result dataset with additional information such as Geolocation, whois registration and threat intel lookups. \n", - "\n", - "Analysts can perform further investigation on selected IP addresses from enrichment results by correlating various data sources available. \n", - "You can then create incidents in Microsoft Sentinel and track investigation in it.\n" - ] - } - ], - "metadata": { - "kernel_info": { - "name": "python38-azureml" - }, - "kernelspec": { - "display_name": "Python 3.8 - AzureML", - "language": "python", - "name": "python38-azureml" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.1" - }, - "microsoft": { - "host": { - "AzureML": { - "notebookHasBeenCompleted": true + "version": "nteract-front-end@1.0.0" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + }, + "varInspector": { + "cols": { + "lenName": 16, + "lenType": 16, + "lenVar": 40 + }, + "kernels_config": { + "python": { + "delete_cmd_postfix": "", + "delete_cmd_prefix": "del ", + "library": "var_list.py", + "varRefreshCmd": "print(var_dic_list())" + }, + "r": { + "delete_cmd_postfix": ") ", + "delete_cmd_prefix": "rm(", + "library": "var_list.r", + "varRefreshCmd": "cat(var_dic_list()) " + } + }, + "types_to_exclude": [ + "module", + "function", + "builtin_function_or_method", + "instance", + "_Feature" + ], + "window_display": false } - } }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} + "nbformat": 4, + "nbformat_minor": 1 +} \ No newline at end of file