What is the best index strategy or query SELECT when performing a search/lookup BETWEEN IP address (IPv4 and...
Question: Is there a better indexing strategy or query SELECT that I can use for looking up one large data set against another large data set? Or, should I look at placing the lookup dimension table in memory (all 125 GB of it)?
Server Configuration:
- The server is a virtual server running on top of VMWare, so additional hardware can be added in the background without having to reinstall the operating system
- Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
Aug 22 2017 17:04:49
Copyright (C) 2017 Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)
Note: I was previously on 2014 Enterprise - I have inquired why I was placed on Standard.- There is only one instance that is running 2 databases: mine and the DBAs
- 2 File groups, with 1 file each: PRIMARY (system tables : not-default) and SECONDARY (non-system tables : default). The SECONDARY was meant to be scalable to hold more files once more CPUs were added. When the file group was initially created the server only had 2 CPUs
- 8 GB memory
- 500 GB disk storage (ISCSI SAN)
- 4 CPUs (Intel I assume)
IIS Exchange Server log table Schema:
CREATE TABLE [FWY].[ExchangeServerLogTest](
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[SourceFileName] [varchar](50) NOT NULL,
[SourceServer] [varchar](9) NOT NULL,
[SourceService] [varchar](6) NOT NULL,
[EventOccuranceTs] [datetime] NOT NULL,
[ServiceType] [varchar](50) NOT NULL,
[UserNameType] [varchar](25) NOT NULL,
[DomainId] [varchar](50) NULL,
[DomainName] [varchar](255) NULL,
[UserNameToLookup] [varchar](255) NOT NULL,
[UserAgent] [varchar](255) NULL,
[OutsideProtocolId] [varchar](10) NOT NULL,
[OutsideIp] [varchar](39) NULL,
[OutsideIpHex] [varbinary](16) NULL,
[InsideProtocolId] [varchar](10) NOT NULL,
[InsideIp] [varchar](39) NULL,
[InsideIpHex] [varbinary](16) NULL,
[DeviceId] [varchar](32) NULL,
[DeviceType] [varchar](25) NULL,
[DeviceModel] [varchar](75) NULL,
[AsOfDt] [date] NULL,
[OutsideProtocolKey] [int] NULL,
[InsideProtocolKey] [int] NULL,
CONSTRAINT [PK_ExchangeServerLogTest] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_ExchangeServerLogTest_InsideOutsideProtocolKeyIpHexInclRowKey] ON [FWY].[ExchangeServerLogTest]
(
[InsideProtocolKey] ASC,
[OutsideProtocolKey] ASC,
[InsideIpHex] ASC,
[OutsideIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP GeoLocation data vendor table schema
CREATE TABLE [DE].[IpGeoLocation](
[CreateTs] [datetime] NOT NULL,
[CreateBy] [varchar](50) NOT NULL,
[CreateSequenceKey] [int] NULL,
[UpdateTs] [datetime] NULL,
[UpdateBy] [varchar](50) NULL,
[UpdateSequenceKey] [int] NULL,
[ActiveInd] [int] NOT NULL,
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[VendorKey] [int] NULL,
[VendorTypeKey] [int] NULL,
[DimensionTypeKey] [int] NULL,
[ProtocolKey] [int] NULL,
[ProtocolId] [varchar](10) NOT NULL,
[EffectiveStartDate] [date] NULL,
[EffectiveEndDate] [date] NULL,
[NetworkStartIp] [varchar](39) NOT NULL,
[NetworkStartIpHex] [varbinary](16) NULL,
[NetworkEndIp] [varchar](39) NOT NULL,
[NetworkEndIpHex] [varbinary](16) NULL,
[Country] [varchar](255) NOT NULL,
[Region] [varchar](255) NOT NULL,
[City] [varchar](255) NOT NULL,
[ConnectionSpeed] [varchar](255) NOT NULL,
[ConnectionType] [varchar](255) NOT NULL,
[MetroCode] [int] NOT NULL,
[Latitude] [numeric](6, 3) NULL,
[Longitude] [numeric](6, 3) NULL,
[PostalCode] [varchar](255) NOT NULL,
[PostalExtension] [varchar](255) NOT NULL,
[CountryCode] [int] NOT NULL,
[RegionCode] [int] NOT NULL,
[CityCode] [int] NOT NULL,
[ContinentCode] [int] NOT NULL,
[TwoLetterCountry] [varchar](2) NOT NULL,
[InternalCode] [int] NOT NULL,
[AreaCodes] [varchar](255) NOT NULL,
[CountryConfidenceCode] [int] NOT NULL,
[RegionConfidenceCode] [int] NOT NULL,
[CityConfidenceCode] [int] NOT NULL,
[PostalConfidenceCode] [int] NOT NULL,
[GmtOffset] [varchar](255) NOT NULL,
[InDistance] [varchar](255) NOT NULL,
[TimeZoneName] [varchar](255) NOT NULL,
CONSTRAINT [PK_IpGeoLocation] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_IpGeoLocation_ProtocolKeyNetworkStartEndIpHexIncRowKey] ON [DE].[IpGeoLocation]
(
[ProtocolKey] ASC,
[NetworkStartIpHex] ASC,
[NetworkEndIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP addresses are converted to their hexadecimal value using .NET's System.Net class: Ipaddress.Parse(IpAddress).GetAddressBytes(). I load the data files with SSIS and I have a script component that returns the ProtocolId and the IP address as a Byte array, which goes into SSIS as DT_BYTE and is mapped to a SQL Server VARBINARY(16) field (the byte array is implicitly converted to a hexadecimal value).
Lookup IP Address range
I have two data sets: IIS Exchange Server IP log records and IP GeoLocation data provided by a 3rd party vendor; where the Geolocation covers a range of IP addresses. I need to lookup the IP address from the log file and get its GeoLocation. Both data sets accommodate for IPv4 and IPv6 and the IP address is received in string format. When I load the data, I convert the IP address into a hexadecimal value [VARBINARY(16)] so that I can lookup an IP addresses GeoLocation.
The problem here is that I am loading a large amount of records. Currently, the vendor provides close to 200 million IP address Geolocations (i.e., dimension lookup table). I knew from the inception that performance optimization will be required at all stages (i.e., hardware configuration, table partitioning, and indexing strategy). I have loaded one week's worth of sample log data and that is approximately 150 million records.
Note: The log files are parsed where approximately 90% of records are ignored - we are only loading 10% of the records, so there is no performance boost that can be made here
I have created the following indexes on the ExchangeLogs table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), IpHex; where the RowId is included
I have created the following indexes on the IPGeoLocation table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), StartIpHex, and EndIpHex; where the RowId is included
When searching for the IP Geolocation, I join the two datasets as follows:
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.InsideProtocolKey = DE.ProtocolKey
AND T.InsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Query Execution Plan: Estimated InsideIp Query Execution Plan
Actual Query Execution Plan: Waiting for query to complete
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.OutsideProtocolKey = DE.ProtocolKey
AND T.OutsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Execution Plan: Estimated OutsideIp Query Execution Plan
Actual Query Execution Plan: DOES NOT FINISH
Note 2: The ProtocolId must be included, otherwise there are two results for each IP lookup: one for IPv4 and one for IPv6.
This seems like a very efficient execution plan considering 95% of the cost is on an index seek and another 2% on an index scan - 97% is attributed to index work.
The log files contain both internal and external IP Address on each row. For the sample data loaded:
- The Internal IP list contains 3 DISTINCT IP addresses.
- The external IP list contains approximately 60,000 DISTINCT IP Address.
Results:
- A SELECT on the internal IP list takes about 9 minutes to complete.
- A SELECT on the external IP list was stopped after allowing it to run for 16.25 hours (overnight).
I have not partitioned either the log table or the IP GeoLocation table. This might provide a performance boost by streaming data through two separate LUNs, but I am still trying to get a hardware configuration specification from our IT Ops group (they just provisioned new servers, so I don't have that info yet).
sql-server performance query-performance index-tuning configuration
|
show 6 more comments
Question: Is there a better indexing strategy or query SELECT that I can use for looking up one large data set against another large data set? Or, should I look at placing the lookup dimension table in memory (all 125 GB of it)?
Server Configuration:
- The server is a virtual server running on top of VMWare, so additional hardware can be added in the background without having to reinstall the operating system
- Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
Aug 22 2017 17:04:49
Copyright (C) 2017 Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)
Note: I was previously on 2014 Enterprise - I have inquired why I was placed on Standard.- There is only one instance that is running 2 databases: mine and the DBAs
- 2 File groups, with 1 file each: PRIMARY (system tables : not-default) and SECONDARY (non-system tables : default). The SECONDARY was meant to be scalable to hold more files once more CPUs were added. When the file group was initially created the server only had 2 CPUs
- 8 GB memory
- 500 GB disk storage (ISCSI SAN)
- 4 CPUs (Intel I assume)
IIS Exchange Server log table Schema:
CREATE TABLE [FWY].[ExchangeServerLogTest](
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[SourceFileName] [varchar](50) NOT NULL,
[SourceServer] [varchar](9) NOT NULL,
[SourceService] [varchar](6) NOT NULL,
[EventOccuranceTs] [datetime] NOT NULL,
[ServiceType] [varchar](50) NOT NULL,
[UserNameType] [varchar](25) NOT NULL,
[DomainId] [varchar](50) NULL,
[DomainName] [varchar](255) NULL,
[UserNameToLookup] [varchar](255) NOT NULL,
[UserAgent] [varchar](255) NULL,
[OutsideProtocolId] [varchar](10) NOT NULL,
[OutsideIp] [varchar](39) NULL,
[OutsideIpHex] [varbinary](16) NULL,
[InsideProtocolId] [varchar](10) NOT NULL,
[InsideIp] [varchar](39) NULL,
[InsideIpHex] [varbinary](16) NULL,
[DeviceId] [varchar](32) NULL,
[DeviceType] [varchar](25) NULL,
[DeviceModel] [varchar](75) NULL,
[AsOfDt] [date] NULL,
[OutsideProtocolKey] [int] NULL,
[InsideProtocolKey] [int] NULL,
CONSTRAINT [PK_ExchangeServerLogTest] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_ExchangeServerLogTest_InsideOutsideProtocolKeyIpHexInclRowKey] ON [FWY].[ExchangeServerLogTest]
(
[InsideProtocolKey] ASC,
[OutsideProtocolKey] ASC,
[InsideIpHex] ASC,
[OutsideIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP GeoLocation data vendor table schema
CREATE TABLE [DE].[IpGeoLocation](
[CreateTs] [datetime] NOT NULL,
[CreateBy] [varchar](50) NOT NULL,
[CreateSequenceKey] [int] NULL,
[UpdateTs] [datetime] NULL,
[UpdateBy] [varchar](50) NULL,
[UpdateSequenceKey] [int] NULL,
[ActiveInd] [int] NOT NULL,
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[VendorKey] [int] NULL,
[VendorTypeKey] [int] NULL,
[DimensionTypeKey] [int] NULL,
[ProtocolKey] [int] NULL,
[ProtocolId] [varchar](10) NOT NULL,
[EffectiveStartDate] [date] NULL,
[EffectiveEndDate] [date] NULL,
[NetworkStartIp] [varchar](39) NOT NULL,
[NetworkStartIpHex] [varbinary](16) NULL,
[NetworkEndIp] [varchar](39) NOT NULL,
[NetworkEndIpHex] [varbinary](16) NULL,
[Country] [varchar](255) NOT NULL,
[Region] [varchar](255) NOT NULL,
[City] [varchar](255) NOT NULL,
[ConnectionSpeed] [varchar](255) NOT NULL,
[ConnectionType] [varchar](255) NOT NULL,
[MetroCode] [int] NOT NULL,
[Latitude] [numeric](6, 3) NULL,
[Longitude] [numeric](6, 3) NULL,
[PostalCode] [varchar](255) NOT NULL,
[PostalExtension] [varchar](255) NOT NULL,
[CountryCode] [int] NOT NULL,
[RegionCode] [int] NOT NULL,
[CityCode] [int] NOT NULL,
[ContinentCode] [int] NOT NULL,
[TwoLetterCountry] [varchar](2) NOT NULL,
[InternalCode] [int] NOT NULL,
[AreaCodes] [varchar](255) NOT NULL,
[CountryConfidenceCode] [int] NOT NULL,
[RegionConfidenceCode] [int] NOT NULL,
[CityConfidenceCode] [int] NOT NULL,
[PostalConfidenceCode] [int] NOT NULL,
[GmtOffset] [varchar](255) NOT NULL,
[InDistance] [varchar](255) NOT NULL,
[TimeZoneName] [varchar](255) NOT NULL,
CONSTRAINT [PK_IpGeoLocation] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_IpGeoLocation_ProtocolKeyNetworkStartEndIpHexIncRowKey] ON [DE].[IpGeoLocation]
(
[ProtocolKey] ASC,
[NetworkStartIpHex] ASC,
[NetworkEndIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP addresses are converted to their hexadecimal value using .NET's System.Net class: Ipaddress.Parse(IpAddress).GetAddressBytes(). I load the data files with SSIS and I have a script component that returns the ProtocolId and the IP address as a Byte array, which goes into SSIS as DT_BYTE and is mapped to a SQL Server VARBINARY(16) field (the byte array is implicitly converted to a hexadecimal value).
Lookup IP Address range
I have two data sets: IIS Exchange Server IP log records and IP GeoLocation data provided by a 3rd party vendor; where the Geolocation covers a range of IP addresses. I need to lookup the IP address from the log file and get its GeoLocation. Both data sets accommodate for IPv4 and IPv6 and the IP address is received in string format. When I load the data, I convert the IP address into a hexadecimal value [VARBINARY(16)] so that I can lookup an IP addresses GeoLocation.
The problem here is that I am loading a large amount of records. Currently, the vendor provides close to 200 million IP address Geolocations (i.e., dimension lookup table). I knew from the inception that performance optimization will be required at all stages (i.e., hardware configuration, table partitioning, and indexing strategy). I have loaded one week's worth of sample log data and that is approximately 150 million records.
Note: The log files are parsed where approximately 90% of records are ignored - we are only loading 10% of the records, so there is no performance boost that can be made here
I have created the following indexes on the ExchangeLogs table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), IpHex; where the RowId is included
I have created the following indexes on the IPGeoLocation table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), StartIpHex, and EndIpHex; where the RowId is included
When searching for the IP Geolocation, I join the two datasets as follows:
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.InsideProtocolKey = DE.ProtocolKey
AND T.InsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Query Execution Plan: Estimated InsideIp Query Execution Plan
Actual Query Execution Plan: Waiting for query to complete
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.OutsideProtocolKey = DE.ProtocolKey
AND T.OutsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Execution Plan: Estimated OutsideIp Query Execution Plan
Actual Query Execution Plan: DOES NOT FINISH
Note 2: The ProtocolId must be included, otherwise there are two results for each IP lookup: one for IPv4 and one for IPv6.
This seems like a very efficient execution plan considering 95% of the cost is on an index seek and another 2% on an index scan - 97% is attributed to index work.
The log files contain both internal and external IP Address on each row. For the sample data loaded:
- The Internal IP list contains 3 DISTINCT IP addresses.
- The external IP list contains approximately 60,000 DISTINCT IP Address.
Results:
- A SELECT on the internal IP list takes about 9 minutes to complete.
- A SELECT on the external IP list was stopped after allowing it to run for 16.25 hours (overnight).
I have not partitioned either the log table or the IP GeoLocation table. This might provide a performance boost by streaming data through two separate LUNs, but I am still trying to get a hardware configuration specification from our IT Ops group (they just provisioned new servers, so I don't have that info yet).
sql-server performance query-performance index-tuning configuration
1
You've included a lot if info, but might want to check this post on how to improve it with more details to help get it answered efficiently.
– LowlyDBA
3 hours ago
@LowlyDBA I have updated the answer to include the server configuration. Are there any other additions or subtractions that you recommend?
– J Weezy
3 hours ago
1
Schema creation scripts & actual plans via pastetheplan.com
– LowlyDBA
3 hours ago
I wonder, how do you represent IPv6 addresses as integers? SQL Server'sint
andbigint
have 4 and 8 bytes respectively, IPv6 needs 16 bytes.
– ypercubeᵀᴹ
3 hours ago
1
Another (minor) possible improvement: if the possible values for ProtocolKey columns are only two (for IPv4 / IPv6) and you could convert all those columns fromint
totinyint
, you would save 3 bytes per row. It won't be a huge saving, but for big tables, it would help (a little). For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear.
– ypercubeᵀᴹ
3 hours ago
|
show 6 more comments
Question: Is there a better indexing strategy or query SELECT that I can use for looking up one large data set against another large data set? Or, should I look at placing the lookup dimension table in memory (all 125 GB of it)?
Server Configuration:
- The server is a virtual server running on top of VMWare, so additional hardware can be added in the background without having to reinstall the operating system
- Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
Aug 22 2017 17:04:49
Copyright (C) 2017 Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)
Note: I was previously on 2014 Enterprise - I have inquired why I was placed on Standard.- There is only one instance that is running 2 databases: mine and the DBAs
- 2 File groups, with 1 file each: PRIMARY (system tables : not-default) and SECONDARY (non-system tables : default). The SECONDARY was meant to be scalable to hold more files once more CPUs were added. When the file group was initially created the server only had 2 CPUs
- 8 GB memory
- 500 GB disk storage (ISCSI SAN)
- 4 CPUs (Intel I assume)
IIS Exchange Server log table Schema:
CREATE TABLE [FWY].[ExchangeServerLogTest](
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[SourceFileName] [varchar](50) NOT NULL,
[SourceServer] [varchar](9) NOT NULL,
[SourceService] [varchar](6) NOT NULL,
[EventOccuranceTs] [datetime] NOT NULL,
[ServiceType] [varchar](50) NOT NULL,
[UserNameType] [varchar](25) NOT NULL,
[DomainId] [varchar](50) NULL,
[DomainName] [varchar](255) NULL,
[UserNameToLookup] [varchar](255) NOT NULL,
[UserAgent] [varchar](255) NULL,
[OutsideProtocolId] [varchar](10) NOT NULL,
[OutsideIp] [varchar](39) NULL,
[OutsideIpHex] [varbinary](16) NULL,
[InsideProtocolId] [varchar](10) NOT NULL,
[InsideIp] [varchar](39) NULL,
[InsideIpHex] [varbinary](16) NULL,
[DeviceId] [varchar](32) NULL,
[DeviceType] [varchar](25) NULL,
[DeviceModel] [varchar](75) NULL,
[AsOfDt] [date] NULL,
[OutsideProtocolKey] [int] NULL,
[InsideProtocolKey] [int] NULL,
CONSTRAINT [PK_ExchangeServerLogTest] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_ExchangeServerLogTest_InsideOutsideProtocolKeyIpHexInclRowKey] ON [FWY].[ExchangeServerLogTest]
(
[InsideProtocolKey] ASC,
[OutsideProtocolKey] ASC,
[InsideIpHex] ASC,
[OutsideIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP GeoLocation data vendor table schema
CREATE TABLE [DE].[IpGeoLocation](
[CreateTs] [datetime] NOT NULL,
[CreateBy] [varchar](50) NOT NULL,
[CreateSequenceKey] [int] NULL,
[UpdateTs] [datetime] NULL,
[UpdateBy] [varchar](50) NULL,
[UpdateSequenceKey] [int] NULL,
[ActiveInd] [int] NOT NULL,
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[VendorKey] [int] NULL,
[VendorTypeKey] [int] NULL,
[DimensionTypeKey] [int] NULL,
[ProtocolKey] [int] NULL,
[ProtocolId] [varchar](10) NOT NULL,
[EffectiveStartDate] [date] NULL,
[EffectiveEndDate] [date] NULL,
[NetworkStartIp] [varchar](39) NOT NULL,
[NetworkStartIpHex] [varbinary](16) NULL,
[NetworkEndIp] [varchar](39) NOT NULL,
[NetworkEndIpHex] [varbinary](16) NULL,
[Country] [varchar](255) NOT NULL,
[Region] [varchar](255) NOT NULL,
[City] [varchar](255) NOT NULL,
[ConnectionSpeed] [varchar](255) NOT NULL,
[ConnectionType] [varchar](255) NOT NULL,
[MetroCode] [int] NOT NULL,
[Latitude] [numeric](6, 3) NULL,
[Longitude] [numeric](6, 3) NULL,
[PostalCode] [varchar](255) NOT NULL,
[PostalExtension] [varchar](255) NOT NULL,
[CountryCode] [int] NOT NULL,
[RegionCode] [int] NOT NULL,
[CityCode] [int] NOT NULL,
[ContinentCode] [int] NOT NULL,
[TwoLetterCountry] [varchar](2) NOT NULL,
[InternalCode] [int] NOT NULL,
[AreaCodes] [varchar](255) NOT NULL,
[CountryConfidenceCode] [int] NOT NULL,
[RegionConfidenceCode] [int] NOT NULL,
[CityConfidenceCode] [int] NOT NULL,
[PostalConfidenceCode] [int] NOT NULL,
[GmtOffset] [varchar](255) NOT NULL,
[InDistance] [varchar](255) NOT NULL,
[TimeZoneName] [varchar](255) NOT NULL,
CONSTRAINT [PK_IpGeoLocation] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_IpGeoLocation_ProtocolKeyNetworkStartEndIpHexIncRowKey] ON [DE].[IpGeoLocation]
(
[ProtocolKey] ASC,
[NetworkStartIpHex] ASC,
[NetworkEndIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP addresses are converted to their hexadecimal value using .NET's System.Net class: Ipaddress.Parse(IpAddress).GetAddressBytes(). I load the data files with SSIS and I have a script component that returns the ProtocolId and the IP address as a Byte array, which goes into SSIS as DT_BYTE and is mapped to a SQL Server VARBINARY(16) field (the byte array is implicitly converted to a hexadecimal value).
Lookup IP Address range
I have two data sets: IIS Exchange Server IP log records and IP GeoLocation data provided by a 3rd party vendor; where the Geolocation covers a range of IP addresses. I need to lookup the IP address from the log file and get its GeoLocation. Both data sets accommodate for IPv4 and IPv6 and the IP address is received in string format. When I load the data, I convert the IP address into a hexadecimal value [VARBINARY(16)] so that I can lookup an IP addresses GeoLocation.
The problem here is that I am loading a large amount of records. Currently, the vendor provides close to 200 million IP address Geolocations (i.e., dimension lookup table). I knew from the inception that performance optimization will be required at all stages (i.e., hardware configuration, table partitioning, and indexing strategy). I have loaded one week's worth of sample log data and that is approximately 150 million records.
Note: The log files are parsed where approximately 90% of records are ignored - we are only loading 10% of the records, so there is no performance boost that can be made here
I have created the following indexes on the ExchangeLogs table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), IpHex; where the RowId is included
I have created the following indexes on the IPGeoLocation table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), StartIpHex, and EndIpHex; where the RowId is included
When searching for the IP Geolocation, I join the two datasets as follows:
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.InsideProtocolKey = DE.ProtocolKey
AND T.InsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Query Execution Plan: Estimated InsideIp Query Execution Plan
Actual Query Execution Plan: Waiting for query to complete
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.OutsideProtocolKey = DE.ProtocolKey
AND T.OutsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Execution Plan: Estimated OutsideIp Query Execution Plan
Actual Query Execution Plan: DOES NOT FINISH
Note 2: The ProtocolId must be included, otherwise there are two results for each IP lookup: one for IPv4 and one for IPv6.
This seems like a very efficient execution plan considering 95% of the cost is on an index seek and another 2% on an index scan - 97% is attributed to index work.
The log files contain both internal and external IP Address on each row. For the sample data loaded:
- The Internal IP list contains 3 DISTINCT IP addresses.
- The external IP list contains approximately 60,000 DISTINCT IP Address.
Results:
- A SELECT on the internal IP list takes about 9 minutes to complete.
- A SELECT on the external IP list was stopped after allowing it to run for 16.25 hours (overnight).
I have not partitioned either the log table or the IP GeoLocation table. This might provide a performance boost by streaming data through two separate LUNs, but I am still trying to get a hardware configuration specification from our IT Ops group (they just provisioned new servers, so I don't have that info yet).
sql-server performance query-performance index-tuning configuration
Question: Is there a better indexing strategy or query SELECT that I can use for looking up one large data set against another large data set? Or, should I look at placing the lookup dimension table in memory (all 125 GB of it)?
Server Configuration:
- The server is a virtual server running on top of VMWare, so additional hardware can be added in the background without having to reinstall the operating system
- Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
Aug 22 2017 17:04:49
Copyright (C) 2017 Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)
Note: I was previously on 2014 Enterprise - I have inquired why I was placed on Standard.- There is only one instance that is running 2 databases: mine and the DBAs
- 2 File groups, with 1 file each: PRIMARY (system tables : not-default) and SECONDARY (non-system tables : default). The SECONDARY was meant to be scalable to hold more files once more CPUs were added. When the file group was initially created the server only had 2 CPUs
- 8 GB memory
- 500 GB disk storage (ISCSI SAN)
- 4 CPUs (Intel I assume)
IIS Exchange Server log table Schema:
CREATE TABLE [FWY].[ExchangeServerLogTest](
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[SourceFileName] [varchar](50) NOT NULL,
[SourceServer] [varchar](9) NOT NULL,
[SourceService] [varchar](6) NOT NULL,
[EventOccuranceTs] [datetime] NOT NULL,
[ServiceType] [varchar](50) NOT NULL,
[UserNameType] [varchar](25) NOT NULL,
[DomainId] [varchar](50) NULL,
[DomainName] [varchar](255) NULL,
[UserNameToLookup] [varchar](255) NOT NULL,
[UserAgent] [varchar](255) NULL,
[OutsideProtocolId] [varchar](10) NOT NULL,
[OutsideIp] [varchar](39) NULL,
[OutsideIpHex] [varbinary](16) NULL,
[InsideProtocolId] [varchar](10) NOT NULL,
[InsideIp] [varchar](39) NULL,
[InsideIpHex] [varbinary](16) NULL,
[DeviceId] [varchar](32) NULL,
[DeviceType] [varchar](25) NULL,
[DeviceModel] [varchar](75) NULL,
[AsOfDt] [date] NULL,
[OutsideProtocolKey] [int] NULL,
[InsideProtocolKey] [int] NULL,
CONSTRAINT [PK_ExchangeServerLogTest] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_ExchangeServerLogTest_InsideOutsideProtocolKeyIpHexInclRowKey] ON [FWY].[ExchangeServerLogTest]
(
[InsideProtocolKey] ASC,
[OutsideProtocolKey] ASC,
[InsideIpHex] ASC,
[OutsideIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP GeoLocation data vendor table schema
CREATE TABLE [DE].[IpGeoLocation](
[CreateTs] [datetime] NOT NULL,
[CreateBy] [varchar](50) NOT NULL,
[CreateSequenceKey] [int] NULL,
[UpdateTs] [datetime] NULL,
[UpdateBy] [varchar](50) NULL,
[UpdateSequenceKey] [int] NULL,
[ActiveInd] [int] NOT NULL,
[RowKey] [int] IDENTITY(1,1) NOT NULL,
[VendorKey] [int] NULL,
[VendorTypeKey] [int] NULL,
[DimensionTypeKey] [int] NULL,
[ProtocolKey] [int] NULL,
[ProtocolId] [varchar](10) NOT NULL,
[EffectiveStartDate] [date] NULL,
[EffectiveEndDate] [date] NULL,
[NetworkStartIp] [varchar](39) NOT NULL,
[NetworkStartIpHex] [varbinary](16) NULL,
[NetworkEndIp] [varchar](39) NOT NULL,
[NetworkEndIpHex] [varbinary](16) NULL,
[Country] [varchar](255) NOT NULL,
[Region] [varchar](255) NOT NULL,
[City] [varchar](255) NOT NULL,
[ConnectionSpeed] [varchar](255) NOT NULL,
[ConnectionType] [varchar](255) NOT NULL,
[MetroCode] [int] NOT NULL,
[Latitude] [numeric](6, 3) NULL,
[Longitude] [numeric](6, 3) NULL,
[PostalCode] [varchar](255) NOT NULL,
[PostalExtension] [varchar](255) NOT NULL,
[CountryCode] [int] NOT NULL,
[RegionCode] [int] NOT NULL,
[CityCode] [int] NOT NULL,
[ContinentCode] [int] NOT NULL,
[TwoLetterCountry] [varchar](2) NOT NULL,
[InternalCode] [int] NOT NULL,
[AreaCodes] [varchar](255) NOT NULL,
[CountryConfidenceCode] [int] NOT NULL,
[RegionConfidenceCode] [int] NOT NULL,
[CityConfidenceCode] [int] NOT NULL,
[PostalConfidenceCode] [int] NOT NULL,
[GmtOffset] [varchar](255) NOT NULL,
[InDistance] [varchar](255) NOT NULL,
[TimeZoneName] [varchar](255) NOT NULL,
CONSTRAINT [PK_IpGeoLocation] PRIMARY KEY CLUSTERED
(
[RowKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]
) ON [SECONDARY]
Non-Clustered Index:
CREATE NONCLUSTERED INDEX [NCIDX_IpGeoLocation_ProtocolKeyNetworkStartEndIpHexIncRowKey] ON [DE].[IpGeoLocation]
(
[ProtocolKey] ASC,
[NetworkStartIpHex] ASC,
[NetworkEndIpHex] ASC
)
INCLUDE ( [RowKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
IP addresses are converted to their hexadecimal value using .NET's System.Net class: Ipaddress.Parse(IpAddress).GetAddressBytes(). I load the data files with SSIS and I have a script component that returns the ProtocolId and the IP address as a Byte array, which goes into SSIS as DT_BYTE and is mapped to a SQL Server VARBINARY(16) field (the byte array is implicitly converted to a hexadecimal value).
Lookup IP Address range
I have two data sets: IIS Exchange Server IP log records and IP GeoLocation data provided by a 3rd party vendor; where the Geolocation covers a range of IP addresses. I need to lookup the IP address from the log file and get its GeoLocation. Both data sets accommodate for IPv4 and IPv6 and the IP address is received in string format. When I load the data, I convert the IP address into a hexadecimal value [VARBINARY(16)] so that I can lookup an IP addresses GeoLocation.
The problem here is that I am loading a large amount of records. Currently, the vendor provides close to 200 million IP address Geolocations (i.e., dimension lookup table). I knew from the inception that performance optimization will be required at all stages (i.e., hardware configuration, table partitioning, and indexing strategy). I have loaded one week's worth of sample log data and that is approximately 150 million records.
Note: The log files are parsed where approximately 90% of records are ignored - we are only loading 10% of the records, so there is no performance boost that can be made here
I have created the following indexes on the ExchangeLogs table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), IpHex; where the RowId is included
I have created the following indexes on the IPGeoLocation table:
- A clustered index on an integer IDENTITY column called RowId
- A non-clustered index on the ProtocolId (i.e., IPv4 or IPv6 represented as integers), StartIpHex, and EndIpHex; where the RowId is included
When searching for the IP Geolocation, I join the two datasets as follows:
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.InsideProtocolKey = DE.ProtocolKey
AND T.InsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Query Execution Plan: Estimated InsideIp Query Execution Plan
Actual Query Execution Plan: Waiting for query to complete
SELECT COUNT(DISTINCT DE.RowKey)
FROM DE.IpGeoLocation DE
INNER JOIN FWY.ExchangeServerLogTest T
ON T.OutsideProtocolKey = DE.ProtocolKey
AND T.OutsideIpHex BETWEEN DE.NetworkStartIpHex AND DE.NetworkEndIpHex
Estimated Execution Plan: Estimated OutsideIp Query Execution Plan
Actual Query Execution Plan: DOES NOT FINISH
Note 2: The ProtocolId must be included, otherwise there are two results for each IP lookup: one for IPv4 and one for IPv6.
This seems like a very efficient execution plan considering 95% of the cost is on an index seek and another 2% on an index scan - 97% is attributed to index work.
The log files contain both internal and external IP Address on each row. For the sample data loaded:
- The Internal IP list contains 3 DISTINCT IP addresses.
- The external IP list contains approximately 60,000 DISTINCT IP Address.
Results:
- A SELECT on the internal IP list takes about 9 minutes to complete.
- A SELECT on the external IP list was stopped after allowing it to run for 16.25 hours (overnight).
I have not partitioned either the log table or the IP GeoLocation table. This might provide a performance boost by streaming data through two separate LUNs, but I am still trying to get a hardware configuration specification from our IT Ops group (they just provisioned new servers, so I don't have that info yet).
sql-server performance query-performance index-tuning configuration
sql-server performance query-performance index-tuning configuration
edited 2 hours ago
J Weezy
asked 4 hours ago
J WeezyJ Weezy
1507
1507
1
You've included a lot if info, but might want to check this post on how to improve it with more details to help get it answered efficiently.
– LowlyDBA
3 hours ago
@LowlyDBA I have updated the answer to include the server configuration. Are there any other additions or subtractions that you recommend?
– J Weezy
3 hours ago
1
Schema creation scripts & actual plans via pastetheplan.com
– LowlyDBA
3 hours ago
I wonder, how do you represent IPv6 addresses as integers? SQL Server'sint
andbigint
have 4 and 8 bytes respectively, IPv6 needs 16 bytes.
– ypercubeᵀᴹ
3 hours ago
1
Another (minor) possible improvement: if the possible values for ProtocolKey columns are only two (for IPv4 / IPv6) and you could convert all those columns fromint
totinyint
, you would save 3 bytes per row. It won't be a huge saving, but for big tables, it would help (a little). For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear.
– ypercubeᵀᴹ
3 hours ago
|
show 6 more comments
1
You've included a lot if info, but might want to check this post on how to improve it with more details to help get it answered efficiently.
– LowlyDBA
3 hours ago
@LowlyDBA I have updated the answer to include the server configuration. Are there any other additions or subtractions that you recommend?
– J Weezy
3 hours ago
1
Schema creation scripts & actual plans via pastetheplan.com
– LowlyDBA
3 hours ago
I wonder, how do you represent IPv6 addresses as integers? SQL Server'sint
andbigint
have 4 and 8 bytes respectively, IPv6 needs 16 bytes.
– ypercubeᵀᴹ
3 hours ago
1
Another (minor) possible improvement: if the possible values for ProtocolKey columns are only two (for IPv4 / IPv6) and you could convert all those columns fromint
totinyint
, you would save 3 bytes per row. It won't be a huge saving, but for big tables, it would help (a little). For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear.
– ypercubeᵀᴹ
3 hours ago
1
1
You've included a lot if info, but might want to check this post on how to improve it with more details to help get it answered efficiently.
– LowlyDBA
3 hours ago
You've included a lot if info, but might want to check this post on how to improve it with more details to help get it answered efficiently.
– LowlyDBA
3 hours ago
@LowlyDBA I have updated the answer to include the server configuration. Are there any other additions or subtractions that you recommend?
– J Weezy
3 hours ago
@LowlyDBA I have updated the answer to include the server configuration. Are there any other additions or subtractions that you recommend?
– J Weezy
3 hours ago
1
1
Schema creation scripts & actual plans via pastetheplan.com
– LowlyDBA
3 hours ago
Schema creation scripts & actual plans via pastetheplan.com
– LowlyDBA
3 hours ago
I wonder, how do you represent IPv6 addresses as integers? SQL Server's
int
and bigint
have 4 and 8 bytes respectively, IPv6 needs 16 bytes.– ypercubeᵀᴹ
3 hours ago
I wonder, how do you represent IPv6 addresses as integers? SQL Server's
int
and bigint
have 4 and 8 bytes respectively, IPv6 needs 16 bytes.– ypercubeᵀᴹ
3 hours ago
1
1
Another (minor) possible improvement: if the possible values for ProtocolKey columns are only two (for IPv4 / IPv6) and you could convert all those columns from
int
to tinyint
, you would save 3 bytes per row. It won't be a huge saving, but for big tables, it would help (a little). For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear.– ypercubeᵀᴹ
3 hours ago
Another (minor) possible improvement: if the possible values for ProtocolKey columns are only two (for IPv4 / IPv6) and you could convert all those columns from
int
to tinyint
, you would save 3 bytes per row. It won't be a huge saving, but for big tables, it would help (a little). For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear.– ypercubeᵀᴹ
3 hours ago
|
show 6 more comments
1 Answer
1
active
oldest
votes
First, I suggest you add two separate indexes, on
(InsideProtocolKey, InsideIpHex) INCLUDE (RowKey)
(OutsideProtocolKey, OutsideIpHex) INCLUDE (RowKey)
and try the queries again. Your 4-column index is not good for the "Outside" query as the columns appear in the 2nd and 4th position and only slightly good for the "inside" query (1st and 3rd). Plus, these 2 indexes will be half in size (20 bytes vs 40 bytes per row).
Second, a minor improvement. Since you only have two options for the
ProtocolKey
column (and its variations, Inside/Outside), you could conevert (all of them) fromint
(4 bytes) totinyint
(1 byte) or even tobit
(1 bit) and save 3 bytes per row (or 3 + 7/8).
It won't be a huge saving, but for big tables, it would help. For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear. I'm not entirely sure about space use of indexes
bit
columns but surely the save would be either the same as withtinyint
(600MB) or more (up to 775MB) for the same table size. Still, and I mention this again, for every index that uses the column.
Smaller indexes, smaller size on disk and more important, less memory and more probable to stay in memory, especially with the low RAM server you have.
Third, 8GB sounds like a very small amount of RAM these days, especially when you have tables of this size. RAM is cheap (at least until you pass the 128GB Standard/Enterprise threshold and then you have the bigger licence charge).
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f231689%2fwhat-is-the-best-index-strategy-or-query-select-when-performing-a-search-lookup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First, I suggest you add two separate indexes, on
(InsideProtocolKey, InsideIpHex) INCLUDE (RowKey)
(OutsideProtocolKey, OutsideIpHex) INCLUDE (RowKey)
and try the queries again. Your 4-column index is not good for the "Outside" query as the columns appear in the 2nd and 4th position and only slightly good for the "inside" query (1st and 3rd). Plus, these 2 indexes will be half in size (20 bytes vs 40 bytes per row).
Second, a minor improvement. Since you only have two options for the
ProtocolKey
column (and its variations, Inside/Outside), you could conevert (all of them) fromint
(4 bytes) totinyint
(1 byte) or even tobit
(1 bit) and save 3 bytes per row (or 3 + 7/8).
It won't be a huge saving, but for big tables, it would help. For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear. I'm not entirely sure about space use of indexes
bit
columns but surely the save would be either the same as withtinyint
(600MB) or more (up to 775MB) for the same table size. Still, and I mention this again, for every index that uses the column.
Smaller indexes, smaller size on disk and more important, less memory and more probable to stay in memory, especially with the low RAM server you have.
Third, 8GB sounds like a very small amount of RAM these days, especially when you have tables of this size. RAM is cheap (at least until you pass the 128GB Standard/Enterprise threshold and then you have the bigger licence charge).
add a comment |
First, I suggest you add two separate indexes, on
(InsideProtocolKey, InsideIpHex) INCLUDE (RowKey)
(OutsideProtocolKey, OutsideIpHex) INCLUDE (RowKey)
and try the queries again. Your 4-column index is not good for the "Outside" query as the columns appear in the 2nd and 4th position and only slightly good for the "inside" query (1st and 3rd). Plus, these 2 indexes will be half in size (20 bytes vs 40 bytes per row).
Second, a minor improvement. Since you only have two options for the
ProtocolKey
column (and its variations, Inside/Outside), you could conevert (all of them) fromint
(4 bytes) totinyint
(1 byte) or even tobit
(1 bit) and save 3 bytes per row (or 3 + 7/8).
It won't be a huge saving, but for big tables, it would help. For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear. I'm not entirely sure about space use of indexes
bit
columns but surely the save would be either the same as withtinyint
(600MB) or more (up to 775MB) for the same table size. Still, and I mention this again, for every index that uses the column.
Smaller indexes, smaller size on disk and more important, less memory and more probable to stay in memory, especially with the low RAM server you have.
Third, 8GB sounds like a very small amount of RAM these days, especially when you have tables of this size. RAM is cheap (at least until you pass the 128GB Standard/Enterprise threshold and then you have the bigger licence charge).
add a comment |
First, I suggest you add two separate indexes, on
(InsideProtocolKey, InsideIpHex) INCLUDE (RowKey)
(OutsideProtocolKey, OutsideIpHex) INCLUDE (RowKey)
and try the queries again. Your 4-column index is not good for the "Outside" query as the columns appear in the 2nd and 4th position and only slightly good for the "inside" query (1st and 3rd). Plus, these 2 indexes will be half in size (20 bytes vs 40 bytes per row).
Second, a minor improvement. Since you only have two options for the
ProtocolKey
column (and its variations, Inside/Outside), you could conevert (all of them) fromint
(4 bytes) totinyint
(1 byte) or even tobit
(1 bit) and save 3 bytes per row (or 3 + 7/8).
It won't be a huge saving, but for big tables, it would help. For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear. I'm not entirely sure about space use of indexes
bit
columns but surely the save would be either the same as withtinyint
(600MB) or more (up to 775MB) for the same table size. Still, and I mention this again, for every index that uses the column.
Smaller indexes, smaller size on disk and more important, less memory and more probable to stay in memory, especially with the low RAM server you have.
Third, 8GB sounds like a very small amount of RAM these days, especially when you have tables of this size. RAM is cheap (at least until you pass the 128GB Standard/Enterprise threshold and then you have the bigger licence charge).
First, I suggest you add two separate indexes, on
(InsideProtocolKey, InsideIpHex) INCLUDE (RowKey)
(OutsideProtocolKey, OutsideIpHex) INCLUDE (RowKey)
and try the queries again. Your 4-column index is not good for the "Outside" query as the columns appear in the 2nd and 4th position and only slightly good for the "inside" query (1st and 3rd). Plus, these 2 indexes will be half in size (20 bytes vs 40 bytes per row).
Second, a minor improvement. Since you only have two options for the
ProtocolKey
column (and its variations, Inside/Outside), you could conevert (all of them) fromint
(4 bytes) totinyint
(1 byte) or even tobit
(1 bit) and save 3 bytes per row (or 3 + 7/8).
It won't be a huge saving, but for big tables, it would help. For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear. I'm not entirely sure about space use of indexes
bit
columns but surely the save would be either the same as withtinyint
(600MB) or more (up to 775MB) for the same table size. Still, and I mention this again, for every index that uses the column.
Smaller indexes, smaller size on disk and more important, less memory and more probable to stay in memory, especially with the low RAM server you have.
Third, 8GB sounds like a very small amount of RAM these days, especially when you have tables of this size. RAM is cheap (at least until you pass the 128GB Standard/Enterprise threshold and then you have the bigger licence charge).
edited 1 hour ago
Erik Darling
21.5k1267108
21.5k1267108
answered 2 hours ago
ypercubeᵀᴹypercubeᵀᴹ
77.4k11134216
77.4k11134216
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f231689%2fwhat-is-the-best-index-strategy-or-query-select-when-performing-a-search-lookup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You've included a lot if info, but might want to check this post on how to improve it with more details to help get it answered efficiently.
– LowlyDBA
3 hours ago
@LowlyDBA I have updated the answer to include the server configuration. Are there any other additions or subtractions that you recommend?
– J Weezy
3 hours ago
1
Schema creation scripts & actual plans via pastetheplan.com
– LowlyDBA
3 hours ago
I wonder, how do you represent IPv6 addresses as integers? SQL Server's
int
andbigint
have 4 and 8 bytes respectively, IPv6 needs 16 bytes.– ypercubeᵀᴹ
3 hours ago
1
Another (minor) possible improvement: if the possible values for ProtocolKey columns are only two (for IPv4 / IPv6) and you could convert all those columns from
int
totinyint
, you would save 3 bytes per row. It won't be a huge saving, but for big tables, it would help (a little). For the not so big, 200M rows x 3 bytes = 600MB save, for every index where the columns appear.– ypercubeᵀᴹ
3 hours ago