Microsoft met au défi la communauté de la science des données de développer des techniques pour prédire si une machine
sera bientôt touchée par différentes familles de logiciels malveillants en fonction des différentes propriétés de cette
machine.
Défi accessible ici.
- Read CSV
- Google Collab link to notebook
- Load a DataFrame in Tensorflow
- Description des données de personnes ayant déjà travaillé dessus
Lien de ce repo : https://github.com/massiltag/Microsoft-Malware-Prediction Liste des colonnes avec beaucoup de NaN :
- DefaultBrowsersIdentifier : 95% de NaN
- Census_IsFlightingInternal : 83% de NaN
- Census_ThresholdOptIn : 63% de NaN
- Census_IsWIMBootEnabled : 63% de NaN
tiny_train comporte les 100 000 premières colonnes afin de train travailler plus rapidement pour prototyper une solution.
-
utiliser machineID comme index (valeurs uniques)
-
supprimer colonnes avec valeurs uniques (ou presque)
-
drop colonnes qui ne sont pas dans le test.csv
-
supprimer les NaNs de la bonne manière, c'est à dire : ne pas mettre des médianes sur des Categocial (car aucun sens) !
-
normaliser les colonnes avec des floats
-
supprimer les colonnes où trop de NaN
-
supprimer ensuite les lignes ou NaN présents (por tester si marche mieux)
-
essyer aussi avec arbre de décision Problèmes à gérer dans supervisé :
-
bias/variance tradeoff (+ biais -> - variance)
- ex : si input x trained systématiquement incorrect pour prédire -> biaisé
- ex : si input x a différentes prédictions depuis train de ≠ dataset -> + grande variance -> = trop flexible
-
complexité fonction et taille dataset
-
dimension input
-
bruit input
-
interactions présentes
-
non linéarité
-
faire stats, clean data, cross validation, dropout, séparer tâches
- partie choix algo, séparer algo et exécution
- essayer implémenter petit algo sktlearn et tester résultats sur petit jeu (100 000 lignes)
- trouver les classifier qui répondraient au besoin
- trouver les features avec + grande variance
- faire le + de tests possible
- regarder hash colonne
-
après : tests unitaires, plots, regarder si il est possible de mapper les infos du PC pour essayer de prédire le risque d'infection
-
si vecteur input a très grande dimension : pb apprentissage peut-être difficile même si fonction apprentissage dépend que d'un petit nombre de features
-
si on essaye de coller trop aux données on overfit
-
quand un certain type de bruit est présent, il vaut mieux augmenter le biais et réduire la variance
-
avantage naïve Bayes : pas besoin d'un jeu de donnée énorme pour trouver les features (sauf si algo a un fort biais)
- P(Ck | x) = (P(Ck) * P(x | Ck)) / P () => posterior = .... mini_train est un fichier de plus petite taille servant à travailler plus rapidement car le fichier original est trop gros. Nombre de lignes de mini_train : 2 230 370
-
arbre de décision : sur valeurs discrètes (i.e. et aussi non NaN/Inf). Nœud = test, feuille = résultat/prédiction
- très utilisé en data mining
- simple à comprendre et interpréter
- gère bien gros volume de données
- peu de préparation de la donnée (car on peut faire tests sur la plupart des données)
- modèle whitebox (= tout est expliqué)
- validation par tests stats simple possible
- immite décision humain
- robuste contre colinéarité (i.e. quand vecteurs vont dans le même sens)
- etc
- limitations :
- pas aussi précis que autres approches
- peut-être non robustes (petit changement dans données -> gros impact),
- learning decision tree = NP-complet
- risque d'overfitting (solution : pruning)
- quand variable catégoriques présentes, va favoriser Categorical avec une grande plage de catégories
-
arbre de régression : comme arbre de décision mais marche pour valeurs continues/Inf
-
KNN :
- prend K + proches voisins
- sensible à sur représentation d'un label -> provoque biais
Train: 8 921 483 rows, 83 columns. Which means a total of 740 483 089 data.
Tous les hash des machines sont renseignés et il n'y a pas de doublon, 0 NaN pour la colonne MachineIdentifier (8 921 483).
Colonne MachineIdentifier peut-être utilisée comme index.
Test : 82 columns (does not contain the column 'HasDetections') Columns with values repeating that could be useful:
- ProductName: {'fep', 'mse', 'scep', 'win8defender', 'mseprerelease', 'windowsintune'}
- IsBeta: {0, 1}
- IsSxsPassiveMode: {0, 1}
- HasTpm : {0, 1}
- Platform : {'windows8', 'windows2016', 'windows7', 'windows10'}
- Processor : {'x64', 'x86', 'arm64'}
- OsSuite : {256, 768, 400, 272, 16, 784, 305, 274, 49, 144, 402, 528, 307, 18}
- OsPlatformSubRelease : {'windows8.1', 'th1', 'rs2', 'prers5', 'windows7', 'rs4', 'th2', 'rs1', 'rs3'}
- SkuEdition : {'Pro', 'Server', 'Invalid', 'Enterprise', 'Home', 'Enterprise LTSB', 'Education', 'Cloud'}
- AutoSampleOptIn : {0, 1}
- Census_MDC2FormFactor : {'Detachable', 'Desktop', 'LargeServer', 'PCOther', 'Convertible', 'SmallServer', 'Notebook', 'AllInOne', 'MediumServer', 'ServerOther', 'LargeTablet', 'IoTOther', 'SmallTablet'}
- Census_DeviceFamily : {'Windows', 'Windows.Server', 'Windows.Desktop'}
- Census_HasOpticalDiskDrive : {0, 1}
- Census_PowerPlatformRoleName : {nan, 'PerformanceServer', 'UNKNOWN', 'SOHOServer', 'Unspecified', 'Desktop', 'AppliancePC', 'EnterpriseServer'}
- Census_OSArchitecture : {'x86', 'amd64', 'arm64'}
- Census_OSBranch : {'rs3_release_svc_escrow_im', 'win8_gdr', 'rs_xbox', 'th2_release_sec', 'rs5_release', 'rs1_release_srvmedia', 'rs_onecore_base_cobalt', 'rs1_release_sec', 'rs3_release_svc_escrow', 'rs1_release', 'rs_prerelease_flt', 'th1', 'th1_st1', 'rs3_release', 'th2_release', 'win7sp1_ldr_escrow', 'rs_prerelease', 'rs5_release_sigma_dev', 'rs5_release_sign', 'rs4_release', 'win7sp1_ldr', 'win8_ldr', 'rs_shell', 'winblue_ltsb', 'winblue_ltsb_escrow', 'rs2_release', 'rs5_release_sigma', 'rs1_release_svc', 'rs3_release_svc', 'rs_onecore_stack_per1', 'Khmer OS', 'rs5_release_edge'}
- Census_OSEdition : {'CloudN', 'ProfessionalN', 'ServerRdsh', 'ProfessionalEducationN', 'EnterpriseSN', 'Home', 'ProfessionalCountrySpecific', 'CoreSingleLanguage', 'Enterprise', 'ServerStandard', 'EnterpriseS', 'Pro', 'Core', 'ProfessionalWorkstationN', 'HomePremium', 'ServerDatacenterACor', 'Enterprise 2015 LTSB', 'Cloud', 'ServerSolution', 'Education', 'CoreCountrySpecific', 'ServerStandardEval', 'ProfessionalWorkstation', 'ServerDatacenter', 'ProfessionalSingleLanguage', 'EnterpriseN', 'professional', 'EducationN', 'CoreN', 'ServerDatacenterEval', 'Ultimate', 'Professional', 'ProfessionalEducation'}
- Census_OSSkuName : {'CORE_COUNTRYSPECIFIC', 'ULTIMATE', 'EDUCATION_N', 'PROFESSIONAL', 'CLOUDN', 'UNDEFINED', 'CORE', 'CORE_SINGLELANGUAGE', 'SERVERRDSH', 'STANDARD_EVALUATION_SERVER', 'PRO_WORKSTATION_N', 'STANDARD_SERVER', 'DATACENTER_EVALUATION_SERVER', 'ENTERPRISE_S', 'PRO_FOR_EDUCATION', 'PRO_WORKSTATION', 'STARTER', 'ENTERPRISE', 'CORE_N', 'ENTERPRISE_N', 'PRO_CHINA', 'EDUCATION', 'UNLICENSED', 'PROFESSIONAL_N', 'DATACENTER_SERVER', 'PRO_SINGLE_LANGUAGE', 'SB_SOLUTION_SERVER', 'ENTERPRISE_S_N', 'CLOUD', 'ENTERPRISEG'}
- Census_OSInstallTypeName : {'Other', 'Update', 'IBSClean', 'UUPUpgrade', 'Clean', 'Refresh', 'Upgrade', 'Reset', 'CleanPCRefresh'}
- Census_OSWUAutoUpdateOptionsName : {'AutoInstallAndRebootAtMaintenanceTime', 'Notify', 'UNKNOWN', 'FullAuto', 'Off', 'DownloadNotify'}
- Census_IsPortableOperatingSystem : {0, 1}
- Census_GenuineStateName : {'OFFLINE', 'IS_GENUINE', 'TAMPERED', 'INVALID_LICENSE', 'UNKNOWN'}
- Census_ActivationChannel : {'Volume:MAK', 'Retail', 'Volume:GVLK', 'OEM:NONSLP', 'OEM:DM', 'Retail:TB:Eval'}
Note: DefaultBrowsersIdentifier", Census_IsFlightingInternal, Census_ThresholdOptIn and Census_IsWIMBootEnabled are not included because to much NaN.
- ProductName : 6
- EngineVersion : 70
- AppVersion : 110
- AvSigVersion : 8531
- IsBeta : 2
- RtpStateBitfield : 7
- IsSxsPassiveMode : 2
- AVProductStatesIdentifier : 28970
- AVProductsInstalled : 8
- AVProductsEnabled : 6
- HasTpm : 2
- CountryIdentifier : 222
- CityIdentifier : 107366
- OrganizationIdentifier : 49
- GeoNameIdentifier : 292
- LocaleEnglishNameIdentifier : 276
- Platform : 4
- Processor : 3
- OsVer : 58
- OsBuild : 76
- OsSuite : 14
- OsPlatformSubRelease : 9
- OsBuildLab : 664
- SkuEdition : 8
- IsProtected : 2
- AutoSampleOptIn : 2
- PuaMode : 3
- SMode : 2
- IeVerIdentifier : 303
- SmartScreen : 22
- Firewall : 2
- UacLuaenable : 11
- Census_MDC2FormFactor : 13
- Census_DeviceFamily : 3
- Census_OEMNameIdentifier : 3832
- Census_OEMModelIdentifier : 175365
- Census_ProcessorCoreCount : 45
- Census_ProcessorManufacturerIdentifier : 7
- Census_ProcessorModelIdentifier : 3428
- Census_ProcessorClass : 4
- Census_PrimaryDiskTotalCapacity : 5735
- Census_PrimaryDiskTypeName : 5
- Census_SystemVolumeTotalCapacity : 536848
- Census_HasOpticalDiskDrive : 2
- Census_TotalPhysicalRAM : 3446
- Census_ChassisTypeName : 53
- Census_InternalPrimaryDiagonalDisplaySizeInInches : 785
- Census_InternalPrimaryDisplayResolutionHorizontal : 2180
- Census_InternalPrimaryDisplayResolutionVertical : 1560
- Census_PowerPlatformRoleName : 11
- Census_InternalBatteryType : 79
- Census_InternalBatteryNumberOfCharges : 41088
- Census_OSVersion : 469
- Census_OSArchitecture : 3
- Census_OSBranch : 32
- Census_OSBuildNumber : 165
- Census_OSBuildRevision : 285
- Census_OSEdition : 33
- Census_OSSkuName : 30
- Census_OSInstallTypeName : 9
- Census_OSInstallLanguageIdentifier : 39
- Census_OSUILocaleIdentifier : 147
- Census_OSWUAutoUpdateOptionsName : 6
- Census_IsPortableOperatingSystem : 2
- Census_GenuineStateName : 5
- Census_ActivationChannel : 6
- Census_IsFlightsDisabled : 2
- Census_FlightRing : 10
- Census_FirmwareManufacturerIdentifier : 712
- Census_FirmwareVersionIdentifier : 50494
- Census_IsSecureBootEnabled : 2
- Census_IsVirtualDevice : 2
- Census_IsTouchEnabled : 2
- Census_IsPenCapable : 2
- Census_IsAlwaysOnAlwaysConnectedCapable : 2
- Wdft_IsGamer : 2
- Wdft_RegionIdentifier : 15
- HasDetections : 2
- MachineIdentifier : object
- ProductName : object
- EngineVersion : object
- AppVersion : object
- AvSigVersion : object
- IsBeta : int64
- RtpStateBitfield : float64
- IsSxsPassiveMode : int64
- DefaultBrowsersIdentifier : float64
- AVProductStatesIdentifier : float64
- AVProductsInstalled : float64
- AVProductsEnabled : float64
- HasTpm : int64
- CountryIdentifier : int64
- CityIdentifier : float64
- OrganizationIdentifier : float64
- GeoNameIdentifier : float64
- LocaleEnglishNameIdentifier : int64
- Platform : object
- Processor : object
- OsVer : object
- OsBuild : int64
- OsSuite : int64
- OsPlatformSubRelease : object
- OsBuildLab : object
- SkuEdition : object
- IsProtected : float64
- AutoSampleOptIn : int64
- PuaMode : object
- SMode : float64
- IeVerIdentifier : float64
- SmartScreen : object
- Firewall : float64
- UacLuaenable : float64
- Census_MDC2FormFactor : object
- Census_DeviceFamily : object
- Census_OEMNameIdentifier : float64
- Census_OEMModelIdentifier : float64
- Census_ProcessorCoreCount : float64
- Census_ProcessorManufacturerIdentifier : float64
- Census_ProcessorModelIdentifier : float64
- Census_ProcessorClass : object
- Census_PrimaryDiskTotalCapacity : float64
- Census_PrimaryDiskTypeName : object
- Census_SystemVolumeTotalCapacity : float64
- Census_HasOpticalDiskDrive : int64
- Census_TotalPhysicalRAM : float64
- Census_ChassisTypeName : object
- Census_InternalPrimaryDiagonalDisplaySizeInInches : float64
- Census_InternalPrimaryDisplayResolutionHorizontal : float64
- Census_InternalPrimaryDisplayResolutionVertical : float64
- Census_PowerPlatformRoleName : object
- Census_InternalBatteryType : object
- Census_InternalBatteryNumberOfCharges : float64
- Census_OSVersion : object
- Census_OSArchitecture : object
- Census_OSBranch : object
- Census_OSBuildNumber : int64
- Census_OSBuildRevision : int64
- Census_OSEdition : object
- Census_OSSkuName : object
- Census_OSInstallTypeName : object
- Census_OSInstallLanguageIdentifier : float64
- Census_OSUILocaleIdentifier : int64
- Census_OSWUAutoUpdateOptionsName : object
- Census_IsPortableOperatingSystem : int64
- Census_GenuineStateName : object
- Census_ActivationChannel : object
- Census_IsFlightingInternal : float64
- Census_IsFlightsDisabled : float64
- Census_FlightRing : object
- Census_ThresholdOptIn : float64
- Census_FirmwareManufacturerIdentifier : float64
- Census_FirmwareVersionIdentifier : float64
- Census_IsSecureBootEnabled : int64
- Census_IsWIMBootEnabled : float64
- Census_IsVirtualDevice : float64
- Census_IsTouchEnabled : int64
- Census_IsPenCapable : int64
- Census_IsAlwaysOnAlwaysConnectedCapable : float64
- Wdft_IsGamer : float64
- Wdft_RegionIdentifier : float64
- HasDetections : int64
The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.
Each row in this dataset corresponds to a machine, uniquely
identified by a MachineIdentifier.
HasDetections is the ground truth and indicates that Malware
was detected on the machine. Using the information and labels in train.csv,
you must predict the value for HasDetections for each machine in test.csv.
The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.
#Columns Unavailable or self-documenting column names are marked with an "NA".
MachineIdentifier- Individual machine IDProductName- Defender state information e.g. win8defenderEngineVersion- Defender state information e.g. 1.1.12603.0AppVersion- Defender state information e.g. 4.9.10586.0AvSigVersion- Defender state information e.g. 1.217.1014.0IsBeta- Defender state information e.g. falseRtpStateBitfield- NAIsSxsPassiveMode- NADefaultBrowsersIdentifier- ID for the machine's default browserAVProductStatesIdentifier- ID for the specific configuration of a user's antivirus softwareAVProductsInstalled- NAAVProductsEnabled- NAHasTpm- True if machine has tpmCountryIdentifier- ID for the country the machine is located inCityIdentifier- ID for the city the machine is located inOrganizationIdentifier- ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industriesGeoNameIdentifier- ID for the geographic region a machine is located inLocaleEnglishNameIdentifier- English name of Locale ID of the current userPlatform- Calculates platform name (of OS related properties and processor property)Processor- This is the process architecture of the installed operating systemOsVer- Version of the current operating systemOsBuild- Build of the current operating systemOsSuite- Product suite mask for the current operating system.OsPlatformSubRelease- Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)OsBuildLab- Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022SkuEdition- The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.IsProtected- This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.AutoSampleOptIn- This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+PuaMode- Pua Enabled mode from the serviceSMode- This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installedIeVerIdentifier- NASmartScreen- This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.Firewall- This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.UacLuaenable- This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.Census_MDC2FormFactor- A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)Census_DeviceFamily- AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.PhoneCensus_OEMNameIdentifier- NACensus_OEMModelIdentifier- NACensus_ProcessorCoreCount- Number of logical cores in the processorCensus_ProcessorManufacturerIdentifier- NACensus_ProcessorModelIdentifier- NACensus_ProcessorClass- A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updatedCensus_PrimaryDiskTotalCapacity- Amount of disk space on primary disk of the machine in MBCensus_PrimaryDiskTypeName- Friendly name of Primary Disk Type - HDD or SSDCensus_SystemVolumeTotalCapacity- The size of the partition that the System volume is installed on in MBCensus_HasOpticalDiskDrive- True indicates that the machine has an optical disk drive (CD/DVD)Census_TotalPhysicalRAM- Retrieves the physical RAM in MBCensus_ChassisTypeName- Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xxCensus_InternalPrimaryDiagonalDisplaySizeInInches- Retrieves the physical diagonal length in inches of the primary displayCensus_InternalPrimaryDisplayResolutionHorizontal- Retrieves the number of pixels in the horizontal direction of the internal display.vCensus_InternalPrimaryDisplayResolutionVertical- Retrieves the number of pixels in the vertical direction of the internal displayCensus_PowerPlatformRoleName- Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the deviceCensus_InternalBatteryType- NACensus_InternalBatteryNumberOfCharges- NACensus_OSVersion- Numeric OS version Example - 10.0.10130.0Census_OSArchitecture- Architecture on which the OS is based. Derived from OSVersionFull. Example - amd64Census_OSBranch- Branch of the OS extracted from the OsVersionFull. Example - OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005Census_OSBuildNumber- OS Build number extracted from the OsVersionFull. Example - OsBuildNumber = 10512 or 10240Census_OSBuildRevision- OS Build revision extracted from the OsVersionFull. Example - OsBuildRevision = 1000 or 16458Census_OSEdition- Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: EnterpriseCensus_OSSkuName- OS edition friendly name (currently Windows only)Census_OSInstallTypeName- Friendly description of what install was used on the machine i.e. cleanCensus_OSInstallLanguageIdentifier- NACensus_OSUILocaleIdentifier- NACensus_OSWUAutoUpdateOptionsName- Friendly name of the WindowsUpdate auto-update settings on the machine.Census_IsPortableOperatingSystem- Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.Census_GenuineStateName- Friendly name of OSGenuineStateID. 0 = GenuineCensus_ActivationChannel- Retail license key or Volume license key for a machine.Census_IsFlightingInternal- NACensus_IsFlightsDisabled- Indicates if the machine is participating in flighting.Census_FlightRing- The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring.Census_ThresholdOptIn- NACensus_FirmwareManufacturerIdentifier- NACensus_FirmwareVersionIdentifier- NACensus_IsSecureBootEnabled- Indicates if Secure Boot mode is enabled.Census_IsWIMBootEnabled- NACensus_IsVirtualDevice- Identifies a Virtual Machine (machine learning model)Census_IsTouchEnabled- Is this a touch device ?Census_IsPenCapable- Is the device capable of pen input ?Census_IsAlwaysOnAlwaysConnectedCapable- Retrieves information about whether the battery enables the device to be AlwaysOnAlwaysConnected .Wdft_IsGamer- Indicates whether the device is a gamer device or not based on its hardware combination.Wdft_RegionIdentifier- NA