Feeds:
Posts
Comments

Archive for the ‘Uncategorized’ Category

The benefits of running databases in the AWS are compelling but how do you get your data there? In this session, we will explore how to use the AWS Database Migration Service (DMS) to migrate on-premise SQL Server tables to DynamoDB in AWS at a very high level.

I will write up a follow-up blog post focusing on the nitty-gritty details of this migration. Until then, happy cloud surfing 🙂

 

 

 

 

 

This slideshow requires JavaScript.

 

 

 

 

 

 

Advertisements

Read Full Post »

In the big-data ecosystem, it is often necessary to move the data from Hadoop file system to external storage containers like S3 or to the data warehouse for further analytics. In this article, I will quickly show you what are the necessary steps that need to be taken while moving the data from HDFS to S3 with some tips and gotchas. In a later article, I will write about moving the same data from S3 to Redshift which is mostly straightforward as long as we have the data prepped up correctly for the date warehouse injection.

 


HDFS Source Directory

hdfs://hadoopcluster.com:9000/data/hive/warehouse/testdb.db/
mapping_analytics_data

HDFS Source Table (optional)

testdb.mapping_analytics_data

HDFS (State directory)

hdfs://hadoopcluster.com:9000/data/test/mapping_analytics_historical.db

S3 Bucket Location

s3://hdfs_bucket/mapping-data/


Step 1: Data preparation in HDFS

 

Data preparation at the source is required so as to make sure there that there are no issues loading the data eventually into Redshift tables. This step is not crucial if you have plans to station this data only in the S3 storage with no goals of copying it to a data warehouse. The reason being is that the Redshift (or any RDBMS tables in that respect) can be very picky about the format of the data, so this script should get the data into a state that Redshift (or any RDBMS) is happy with. Also once the data is in storage container it is almost always an uphill battle to make any changes (esp the one that relates to the schema) at that time than when the data is still on HDFS. This is also the time when you architect and design your data warehouse tables that are ready for data injection.

Most of the issues that I faced during the S3 to Redshift load are related to having the null values and sometimes with the data type mismatch due to a special character. To transform the data I have created a new directory in HDFS and used the INSERT OVERWRITE DIRECTORY script in Hive to copy data from existing location (or table) to the new location. If you rather need the data moved to a Hive table instead of a directory you can either useINSERT OVERWRITE TABLE or just create an external table over the new data directory. See,  Writing data into the filesystem from queries

Here are some of the configurations that I have used to make the process easier.

  • Used Spark on Hive to utilize Apache Spark as the Hive’s execution engine for faster execution. You must have Spark installed on your cluster to make this work but you do not have to use it. More info here
  • Utilized Gzip compression to help with faster network copy and saves space in S3 bucket
  • NULL values are replaced with blank strings or other literals by using nvl function 
  • Removed Hypens in the date column using regexp_replace function.

 

-- ## Transformation and Insert Script within HDFS ## --
-- enable compression and set engine to use spark execution
--
--
set hive.execution.engine=spark;
set mapred.reduce.tasks=1;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.orc.default.compress = gzip

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE unid_mapping_analytics_pyspark;

-- write to directory
INSERT OVERWRITE DIRECTORY "hdfs://hadoopcluster.com:9000/data/TEST
/mapping_analytics_historical.db/dt=${hiveconf:DATE_PARTITION}"
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY "\t"
      STORED AS TEXTFILE
--
--
SELECT
    NVL(regexp_replace(date,'-',''), ""),
    NVL(source, ""),
    NVL(dimension,""),
    CAST(NVL(value, "0") as bigint)
FROM
    testdb.mapping_analytics_data
WHERE
    rec_date = "${hiveconf:DATE_PARTITION}"
--
--

 

Here is the Hive query to invoke the above script using the command line. Note the -f option where you will provide the above insert script and I have used -hiveconf to pass the date parameters. Have to run this from the hdfs cluster which can access the old and new hdfs location. See, Hive Batch Mode Commands

 

/usr/bin/hive -hiveconf "DATE_PARTITION=2017-11-02" \
 -f $HIVE_SCRIPTS/stage_HDFS_Insert.sql 2&1 \
 tee ${LOG_FILE_PREFIX}-stage_hdfstoS3.log

Step 2: HDFS to S3 Migration

 

Finally, we will move the cleansed data to S3 using the DistCp command, which is often used in data movement workflows in Hadoop ecosystem. It provides a distributed copy capability built on top of a MapReduce framework. The below code shows copying data from HDFS location to the S3 bucket. 

 

##
/opt/hadoop/bin/hadoop distcp hdfs://hadoopcluster.com:9000/data \
/TEST/mapping_analytics_historical.db/dt=2017-11-02/* \
 s3a://$AWS_ACCESS_KEY:$AWS_SECRET_KEY@hdfs_bucket/mapping-data \
/dt=2017-11-02\
$LOG_DIR/mapping-log-$DATE_PARTITION.log 2&1
##

 

Note: S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features in addition to moving data between HDFS and S3. 

 

Screen Shot 2017-11-06 at 10.54.34 AM

 

From the above snippet note that I have multiple files in the S3 container. Although it is not a requirement it is usually a best practice to have multiple files in distributed systems. In my case, the Spark execution engine automatically splits the output into multiple files due to Spark’s distributed way of computation.

If you use hive (mapreduce only) and want to move the data to Redshift it is a best practice to split the files before loading to Redshift tables as the COPY command to Redshift loads data in parallel from multiple files using the massively parallel processing (MPP) architecture. If you loading data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. See more on this, Loading data from Amazon S3

Read Full Post »

We are happy to announce the birth of DataGinger. Starting today we officially re-branded ourselves as DataGinger.com (previously SQLServerZest.com). The main motive for this change is to expand our blogging topics beyond SQL Server and widen our scope to include all data and supporting technologies.

Enjoy learning!

-Your Friends at DataGinger

Read Full Post »

Starting with SQL Server 2012, AlwaysOn Availability Groups provided group level high availability for any number of databases for multiple secondaries known as ‘replicas’. The secondary replicas allow direct read-only querying or can enforce connections that specify ‘ReadOnly’ as their Application Intent using the new feature called Read-Only routing which can be leveraged to scale out reporting workloads. However, in SQL Server 2012 and 2014 versions this redirection is only concerned with the first secondary replica defined in the priority list and all the read-only connections are routed only to that one replica by design. This restricts the other secondary replicas from participating in the load distribution process and thereby reducing the load balancing capability. This article provides you with the configuration and testing of read-only routing along with configuring a custom SQL Agent job in an attempt to create an improved load balancing effect.

Read-only routing refers to the ability of SQL Server to route incoming read-intent connection requests, which are directed to an availability group listener, to an available readable secondary replica. One of the pre-requisites to support read-only routing is that the availability replicas must be enabled for read access.

Tip: Use the below script to check if Read-Only Routing is already configured in your server

SELECT ag.name as "Availability Group", ar.replica_server_name as "When Primary Replica Is",
rl.routing_priority as "Routing Priority", ar2.replica_server_name as "RO Routed To",
ar.secondary_role_allow_connections_desc, ar2.read_only_routing_url
FROM sys.availability_read_only_routing_lists rl
             INNER JOIN sys.availability_replicas ar on rl.replica_id = ar.replica_id
             INNER JOIN sys.availability_replicas ar2 on rl.read_only_replica_id = ar2.replica_id
             INNER JOIN sys.availability_groups ag on ar.group_id = ag.group_id 
ORDER BY ag.name, ar.replica_server_name, rl.routing_priority 

To make it easy to understand, in this demo, we will use the below terminology:

  • Availability group named AG
  • Listener named AGLISTEN
  • Replicas SQL01A (primary) and SQL01B(secondary)

NOTE: Read-only routing can support ALLOW_CONNECTIONS property set to READ_ONLY or ALL (Graphically shown below)

0

Once the secondaries are set to readable (Read-Intent only/Yes), the below three steps are required to configure Read-Only Routing –

  1. Define a read-only routing URL
  2. Define a read-only routing List
  3. Update the client’s connection string to specify Application Intent connection property as ‘read-only’

Let’s take a look at the above steps in details.

1.   Configure Read-Only routing URL

A read_only_routing_url is the entry point of an application to connect to a readable secondary. It contains the system address or the port number that identifies the replica when acting as a readable secondary. This is similar to the endpoint URL we specify when configuring database mirroring. For each readable secondary replica that is to support read-only routing, you need to specify this routing URL

For example, define a URL SQL01B, so that when SQL01B is in the secondary role, it can accept read-only connections.

ALTER AVAILABILITY GROUP AG MODIFY REPLICA ON N'SQL01A' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SQL01A:1433'))
ALTER AVAILABILITY GROUP AG MODIFY REPLICA ON N'SQL01B' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SQL01B:1433'))

Tip: Use THIS code to generate routing URLs for each available secondary replicas to use in the above script


 

2.   Configure Read-Only routing List

For each replica that will act as primary, we need to define the corresponding secondary replicas that will act as the routing target for read-only workloads. This means that if the replica is acting as a primary, all read-only workloads will be redirected to the replicas in the read-only routing list. For example, when SQL01A is in the primary role, define our routing list to consist of SQL01B which is where read-only connection requests will be routed first and if it is not available or not synchronizing (Only in SQL Server 2012) connections will go to the next server in the list.

ALTER AVAILABILITY GROUP AG MODIFY REPLICA ON N'SQL01A' WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST= ('SQL01B', ‘SQL01A’))); 
GO 
ALTER AVAILABILITY GROUP AG MODIFY REPLICA ON N'SQL01B' WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST= ('SQL01A', ‘SQL01B’))); 

Tip: Alternatively to automate the above process you can use THIS script to dynamically generate the scripts required mentioned in the above tasks

Unfortunately there is no graphical user interface to perform these tasks in SSMS. The read-only routing URL and the routing list can be performed only through Transact-SQL or PowerShell

NOTE: As a best practice it is always recommended to assign the primary replica name at the end of the routing list separated by comma, in the rare event if all of the available secondary replicas are not available.

3.   Update client connection string

Read-only clients must direct their connection requests to this listener, and the client’s connection strings must specify the application intent as “read-only.” That is, they must be read-intent connection requests. This can be seen in the connection string, an example is shown below:

Server=tcp:aglisten,1433;Database=agdb1;IntegratedSecurity=SSPI;
ApplicationIntent=ReadOnly;MultiSubnetFailover=True

Before making client side changes you can confirm this newly configured read-only routing using SQL CMD by specifying application intent option (-K) as shown below

Sqlcmd –S AGLISTEN –E –d AGDB1 –K readonly

1

Load Balancing using Read-Only Routing List

The read-only routing introduced in SQL Server 2012 is used for redirection and offloading the read queries to the secondary replicas instead of primary replica. However, this redirection is only concerned with the first secondary replica defined in the priority list that we define. Since the primary replica strictly traverses the list and looks for the first replica that can serve the connection request. Once found, all subsequent read-only connections are routed to it. For example, in a multiple secondary architecture, all the read intent queries only hit the first secondary replica in the list while other secondaries do not participate in distributing this load. This limits the load balancing capability among other secondary replicas.

To overcome this situation here a workaround that will modify the Read-Only Routing list periodically to let read intent queries to use all the replicas at certain intervals (every 30 seconds in this case), so as to create a load balancing effect. This is only applicable to SQL Servers running 2012 and 2014 versions since starting with SQL Server 2016, Microsoft changed the game by introducing native load-balancing capabilities which we will look into later in the paper

WHILE 1=1
Begin
If (
SELECT ARS.role_desc
FROM SYS.availability_REPLICAs AR
join sys.dm_hadr_availability_REPLICA_states ARS ON AR.REPLICA_id = ARS.REPLICA_id
WHERE AR.REPLICA_server_name = (select @@SERVERNAME)
) = 'PRIMARY' and (select count(*) from sys.availability_read_ONly_routing_lists) > 1
Begin
ALTER AVAILABILITY GROUP [AGDB1]
MODIFY REPLICA ON N'SQL01A' WITH (Primary_Role (READ_ONLY_ROUTING_LIST =('SQL01B','SQL01A')))
ALTER AVAILABILITY GROUP [AGDB1]
MODIFY REPLICA ON N'SQL01B' WITH (Primary_Role (READ_ONLY_ROUTING_LIST =('SQL01A','SQL01B')))
--print 'changing ROR URL in 30 seconds...'
WAITFOR DELAY '00:00:30'
--print 'Changing ROR URL'
--Run every 30 secONds
ALTER AVAILABILITY GROUP [AGDB1]
MODIFY REPLICA ON N'SQL01A' WITH (Primary_Role (READ_ONLY_ROUTING_LIST =('SQL01A','SQL01B')))
ALTER AVAILABILITY GROUP [AGDB1]
MODIFY REPLICA ON N'SQL01B' WITH (Primary_Role (READ_ONLY_ROUTING_LIST =('SQL01B','SQL01A')))

END
WAITFOR DELAY '00:00:30'
End

Note: You can add additional replica details based on the number of secondary replicas configured in your Read-Only Routing

To complete the procedure run the code from above in a new query window in SSMS for testing purposes and once verified you can then use the code to create a SQL agent job in all the replicas. This job needs to be run continuously on each replica. The code will only run from the instance that is in the PRIMARY role.

The verify the Read-Only Routing is rotating correctly run the below script

SELECT ag.name as "Availability Group", ar.replica_server_name as "When Primary Replica Is",
rl.routing_priority as "Routing Priority", ar2.replica_server_name as "RO Routed To"
FROM sys.availability_read_only_routing_lists rl
    INNER JOIN sys.availability_replicas ar on rl.replica_id = ar.replica_id
    INNER JOIN sys.availability_replicas ar2 on rl.read_only_replica_id = ar2.replica_id
    INNER JOIN sys.availability_groups ag on ar.group_id = ag.group_id
ORDER BY ag.name, ar.replica_server_name, rl.routing_priority

4

After 30 seconds, notice that the “RO Routed To” column alternates among the available secondary replicas.

4

As evident from above result, this code modifies Read-Only routing list bringing in a new secondary replica into play periodically to cater for the read intent connections essentially creating a load balancing effect. Load balancing using this technique provides a way to get even more use from server hardware that host secondary’s databases and provide reporting applications with better performance and throughput especially for long and resource intensive queries. Please note that this algorithm is limited but it serves the purpose quite effectively. A similar but much robust algorithm has been built into the native SQL engine starting with SQL Server 2016.

Read Full Post »

If you are familiar with the new feature Read-Only Routing introduced in SQL Server 2012 with AlwaysOn then you should keep this script handy to generate URLs to be used in Read-Only Routing configuration instead of typing down yourself to limit any errors.

Connect to each replica in your AlwaysOn Availability Group and run the below script to get the read_only_routing_url for that replica.
Then copy the URL from the result set and use it when setting up read_only_routing_url. Find more details on that here

PRINT 'Read-only-routing url script v.2012.1.24.1'
 
PRINT 'This SQL Server instance version is [' + cast(serverproperty('ProductVersion') as varchar(256)) + ']'
 
IF (ServerProperty('IsClustered') = 1) 
BEGIN
    PRINT 'This SQL Server instance is a clustered SQL Server instance.'
END
ELSE
BEGIN
    PRINT 'This SQL Server instance is a stANDard (not clustered) SQL Server instance.'   
END
 
IF (ServerProperty('IsHadrEnabled') = 1) 
BEGIN
    PRINT 'This SQL Server instance is enabled for AlwaysOn.'
END
ELSE
BEGIN
    PRINT 'This SQL Server instance is NOT enabled for AlwaysOn.'
END
 
-- Detect SQL Azure instance. 
DECLARE @is_sql_azure bit
SET @is_sql_azure = 0
 
BEGIN try 
    SET @is_sql_azure = 1 
    EXEC('DECLARE @i int SET @i = sql_connection_mode()') 
    PRINT 'This SQL Server instance is a Sql Azure instance.'
END try 
BEGIN catch 
    SET @is_sql_azure = 0 
    PRINT 'This SQL Server instance is NOT a Sql Azure instance.'
END catch
 
-- Check that this is SQL 11 or later, otherwise fail fast. 
IF (@@microsoftversion / 0x01000000 < 11 or @is_sql_azure > 0) 
BEGIN
    PRINT 'This SQL Server instance does not support read-only routing, exiting script.'
END
ELSE
BEGIN -- IF server supports read-only routing
 
    -- Fetch the dedicated admin connection (dac) port. 
    -- Normally it's always port 1434, but to be safe here we fetch it FROM the instance. 
    -- We use this later to exclude the admin port FROM read_only_routing_url. 
    DECLARE @dac_port int
    DECLARE @reg_value varchar(255) 
    EXEC xp_instance_regread 
        N'HKEY_LOCAL_MACHINE', 
        N'SOFTWARE\Microsoft\Microsoft SQL Server\\MSSQLServer\SuperSocketNetLib\AdminConnection\Tcp', 
        N'TcpDynamicPorts', 
        @reg_value output
 
    SET @dac_port = cast(@reg_value as int)
 
    PRINT 'This SQL Server instance DAC (dedicated admin) port is ' + cast(@dac_port as varchar(255)) 
    IF (@dac_port = 0) 
    BEGIN 
        PRINT 'Note a DAC port of zero means the dedicated admin port is not enabled.' 
    END
 
    -- Fetch ListenOnAllIPs value. 
    -- IF SET to 1, this means the instance is listening to all IP addresses. 
    -- IF SET to 0, this means the instance is listening to specIFic IP addresses. 
    DECLARE @listen_all int 
    EXEC xp_instance_regread 
        N'HKEY_LOCAL_MACHINE', 
        N'SOFTWARE\Microsoft\Microsoft SQL Server\\MSSQLServer\SuperSocketNetLib\Tcp', 
        N'ListenOnAllIPs', 
        @listen_all output
 
    IF (@listen_all = 1) 
    BEGIN 
        PRINT 'This SQL Server instance is listening to all IP addresses (default mode).' 
    END 
    ELSE 
    BEGIN 
        PRINT 'This SQL Server instance is listening to specIFic IP addresses (ListenOnAllIPs is disabled).' 
    END
 
    -- Check for dynamic port configuration, not recommended with read-only routing. 
    DECLARE @tcp_dynamic_ports varchar(255) 
    EXEC xp_instance_regread 
        N'HKEY_LOCAL_MACHINE', 
        N'SOFTWARE\Microsoft\Microsoft SQL Server\\MSSQLServer\SuperSocketNetLib\Tcp\IPAll', 
        N'TcpDynamicPorts', 
        @tcp_dynamic_ports output
 
    IF (@tcp_dynamic_ports = '0') 
    BEGIN 
        PRINT 'This SQL Server instance is listening on a dynamic tcp port, this is NOT A RECOMMENDED CONFIGURATION when using read-only routing, because the instance port can change each time the instance is restarted.' 
    END 
 ELSE 
    BEGIN 
        PRINT 'This SQL Server instance is listening on fixed tcp port(s) (it is not configured for dynamic ports), this is a recommended configuration when using read-only routing.' 
    END
 
    -- Calculate the server domain AND instance FQDN. 
    -- We use @server_domain later to build the FQDN to the clustered instance. 
    DECLARE @instance_fqdn varchar(255) 
    DECLARE @server_domain varchar(255)
 
    -- Get the instance FQDN using the xp_getnetname API 
    -- Note all cluster nodes must be in same domain, so this works for calculating cluster FQDN. 
    SET @instance_fqdn = '' 
		EXEC xp_getnetname @instance_fqdn output, 1 
 
    -- Remove embedded null character at END IF found. 
    DECLARE @terminator int 
		SET @terminator = charindex(char(0), @instance_fqdn) - 1 
			 IF (@terminator > 0) 
		BEGIN 
			SET @instance_fqdn = substring(@instance_fqdn, 1, @terminator) 
		END
     -- Build @server_domain using @instance_fqdn. 
		 SET @server_domain = @instance_fqdn
     -- Remove trailing portion to extract domain name. 
		 SET @terminator = charindex('.', @server_domain) 
    IF (@terminator > 0) 
		 BEGIN 
			SET @server_domain = substring(@server_domain, @terminator+1, datalength(@server_domain)) 
		 END 
    PRINT 'This SQL Server instance resides in domain ''' +  @server_domain + ''''
 
    IF (ServerProperty('IsClustered') = 1) 
    BEGIN 
        -- Fetch machine name, which for a clustered SQL instance returns the network name of the virtual server. 
        -- AppEND @server_domain to build the FQDN. 
        SET @instance_fqdn = cast(serverproperty('MachineName') as varchar(255)) + '.' + @server_domain 
    END
 
    DECLARE @ror_url varchar(255) 
    DECLARE @instance_port int
 
    SET @ror_url = '' 
    -- Get first available port for instance. 
    SELECT TOP 1    -- SELECT first matching port 
    @instance_port = port 
    FROM sys.dm_tcp_listener_states 
    WHERE type=0 -- Type 0 = TSQL (to avoid mirroring ENDpoint) 
    AND state=0    --  State 0 is online    
    AND port <> @dac_port -- Avoid DAC port (admin port) 
    AND 
    -- Avoid availability GROUP listeners 
    ip_address not in (SELECT ip_address FROM sys.availability_GROUP_listener_ip_addresses agls) 
    GROUP by port        
    ORDER BY port asc  -- Pick first port in ascENDing ORDER
 
    -- Check IF there are multiple ports AND warn IF this is the case. 
    DECLARE @list_of_ports varchar(max) 
    SET @list_of_ports = ''
 
    SELECT 
    @list_of_ports = @list_of_ports + 
        case datalength(@list_of_ports) 
        when 0 then cast(port as varchar(max)) 
        ELSE ',' +  cast(port as varchar(max)) 
        END 
    FROM sys.dm_tcp_listener_states 
    WHERE type=0    --     Type 0 = TSQL (to avoid mirroring ENDpoint) 
    AND  state=0    --  State 0 is online    
    AND  port <> @dac_port -- Avoid DAC port (admin port) 
    AND 
    -- Avoid availability GROUP listeners 
    ip_address not in (SELECT ip_address FROM sys.availability_GROUP_listener_ip_addresses agls) 
    GROUP BY port        
    ORDER BY port asc
 
    PRINT 'This SQL Server instance FQDN (Fully QualIFied Domain Name) is ''' + @instance_fqdn + '''' 
    PRINT 'This SQL Server instance port is ' + cast(@instance_port as varchar(10))
 
    SET @ror_url = 'tcp://' + @instance_fqdn + ':' + cast(@instance_port as varchar(10))
 
    PRINT '****************************************************************************************************************' 
    PRINT 'The read_only_routing_url for this SQL Server instance is ''' + @ror_url + '''' 
    PRINT '****************************************************************************************************************'
 
    -- IF there is more than one instance port (unusual) list them out just in case. 
    IF (charindex(',', @list_of_ports) > 0) 
    BEGIN 
        PRINT 'Note there is more than one instance port, the list of available instance ports for read_only_routing_url is (' + @list_of_ports + ')' 
        PRINT 'The above URL just uses the first port in the list, but you can use any of these available ports.' 
    END
	END

Read Full Post »