The most of the times we don’t have enough data, or even any access to customer real data, so we have to generate some rows. In order to do that, i just wrote a python script to generate dummy data and I want to share it with you.
In very simple terms, this python script works generating rows for a given HANA table; it takes table metadata for columns definitions (taking account data types, length, etc), and then it generates a random value according to column specification. I’m using Python and SAP Python HDBI client for SAP HANA, and I think it would work with PyHDB Python library as well.
So, after you establish a valid connection, you can start executing SQL Queries with hdbcli library. I’m working with SAP HANA Express Edition (HXE), and you can connect to HXE and get table metadata with something similar to this:
from hdbcli import dbapi
connection = dbapi.connect(
address='hxehost',
port= 39015,
user='SYSTEM',
password='yourpassword'
)
cursor = connection.cursor()
QRY = " SELECT COLUMN_NAME, DATA_TYPE_NAME, LENGTH, SCALE "
QRY += " FROM TABLE_COLUMNS WHERE SCHEMA_NAME ='"
QRY += SCHEMA + "' AND TABLE_NAME ='"
QRY += TABLE + "' ORDER BY POSITION ASC"
cursor.execute(QRY)
coldefinition = cursor.fetchall()
If you iterate over coldefinition variable, you will get column name, column data type, and column length for every table column defined by SCHEMA and TABLE variables.
I defined a set of functions that will be applied for every column definition in coldefinition; this allows you to get a random value depending on column data type and also to define some exceptions for any specific column. I mean, if you have a column COLUMNA with a data type VARCHAR(20), you may want to have a random STRING with length 20 but maybe with COLUMNB (that is also VARCHAR(20)) you don’t want any string but INTEGER values.
Those functions look like this:
def setVal(colDef):
if colDef[1] == 'DECIMAL':
return randomDec(colDef[2]-colDef[3], colDef[3])
elif colDef[1] == 'NVARCHAR':
return randomStr(colDef[2])
elif colDef[1] == 'MATERIAL':
return random.choice(materiales)
elif colDef[1] == 'FECHAPERIODO':
return randomDate(20170101, 20181231)
else:
return ''
Functions randomDec and randomStr are custom functions that you can use to manipulate and generate very specific values or just to wrap a call to random library (or any you would need). You can even decide (with custom functions) which random value statistic distribution data will follow.
def randomDec(p,s):
return random.randint(0, (10**p)-1)/(10**s)
def randomStr(length):
STR = string.ascii_uppercase
return ''.join(random.choice(STR) for i in range(length))
If you defined an exception where column definition is something different like ‘MATERIAL’ or ‘FECHAPERIODO’ then you can take further actions. In this particular case, I didn’t want to have random material numbers but just randomly choose from a specific list of possible values. This exception can assure you to have a common column values distributed on different tables (in case you want to make some SAP HANA modeling work and you don’t want to have empty joins because of pure random values in different tables).
def randomDate(start, end):
start = datetime.strptime(str(start), '%Y%m%d')
end = datetime.strptime(str(end), '%Y%m%d')
return str(start + timedelta(
seconds=random.randint(0, int((end - start).total_seconds())),
))
I defined my exceptions following JSON Document notation format like this:
{
"TABLESCHEMA":{
"TABLEA":{
"PLANT": ["PLANTA"],
"MATERIAL": ["MATERIAL"],
"CALDAY": ["FECHAPERIODO"],
"CALMONTH": ["FECHACORTA"]
},
"TABLEB":{
"PLANT": ["PLANTA"],
"MATERIAL": ["MATERIAL"],
"CALMONTH": ["FECHACORTA"]
}
}
}
Every time that this python script founds a table TABLEA on schema TABLESCHEMA, it goes for its columns definitions. If it founds a column name listed on exceptions list, then it applies the specified “definition”, on any other case it keeps HANA DATA TYPE value.
Finally, you have two options to use data generated with this script: you can insert it directly to HANA table or you can have a CSV file that can be later uploaded to a HANA table. From my point of view, it will depend on how many rows you want to generate and how fast you want to have your information. If you only generate CSV data, the process will end relatively fast and then you can import generated data using IMPORT sql instructions; if you decide to insert it directly to defined HANA table, it will toke more time, but you won’t have to import it on an additional step. You may face one restriction or another if you need to work on an On Premise installation or on a Cloud instance.
After you extract (or clone) this script on a directory, you can run this script with this parameters:
python genDummy.py -s SCHEMA -t TABLEA -f exceptions.json -r 1000 -d 20170101 -o
where:
◈ -s: schema name where table is located
◈ -t: table name for which you are generating this data
◈ -f: file with exceptions definition
◈ -r: number of rows that this script will generate
◈ -d: additional parameter that (in this particular case) was a date used for generating a particular value
◈ -o: if present, a CSV file will be generated; if not present, data will be inserted directly on HANA table
In very simple terms, this python script works generating rows for a given HANA table; it takes table metadata for columns definitions (taking account data types, length, etc), and then it generates a random value according to column specification. I’m using Python and SAP Python HDBI client for SAP HANA, and I think it would work with PyHDB Python library as well.
So, after you establish a valid connection, you can start executing SQL Queries with hdbcli library. I’m working with SAP HANA Express Edition (HXE), and you can connect to HXE and get table metadata with something similar to this:
from hdbcli import dbapi
connection = dbapi.connect(
address='hxehost',
port= 39015,
user='SYSTEM',
password='yourpassword'
)
cursor = connection.cursor()
QRY = " SELECT COLUMN_NAME, DATA_TYPE_NAME, LENGTH, SCALE "
QRY += " FROM TABLE_COLUMNS WHERE SCHEMA_NAME ='"
QRY += SCHEMA + "' AND TABLE_NAME ='"
QRY += TABLE + "' ORDER BY POSITION ASC"
cursor.execute(QRY)
coldefinition = cursor.fetchall()
If you iterate over coldefinition variable, you will get column name, column data type, and column length for every table column defined by SCHEMA and TABLE variables.
Those functions look like this:
def setVal(colDef):
if colDef[1] == 'DECIMAL':
return randomDec(colDef[2]-colDef[3], colDef[3])
elif colDef[1] == 'NVARCHAR':
return randomStr(colDef[2])
elif colDef[1] == 'MATERIAL':
return random.choice(materiales)
elif colDef[1] == 'FECHAPERIODO':
return randomDate(20170101, 20181231)
else:
return ''
Functions randomDec and randomStr are custom functions that you can use to manipulate and generate very specific values or just to wrap a call to random library (or any you would need). You can even decide (with custom functions) which random value statistic distribution data will follow.
def randomDec(p,s):
return random.randint(0, (10**p)-1)/(10**s)
def randomStr(length):
STR = string.ascii_uppercase
return ''.join(random.choice(STR) for i in range(length))
If you defined an exception where column definition is something different like ‘MATERIAL’ or ‘FECHAPERIODO’ then you can take further actions. In this particular case, I didn’t want to have random material numbers but just randomly choose from a specific list of possible values. This exception can assure you to have a common column values distributed on different tables (in case you want to make some SAP HANA modeling work and you don’t want to have empty joins because of pure random values in different tables).
def randomDate(start, end):
start = datetime.strptime(str(start), '%Y%m%d')
end = datetime.strptime(str(end), '%Y%m%d')
return str(start + timedelta(
seconds=random.randint(0, int((end - start).total_seconds())),
))
I defined my exceptions following JSON Document notation format like this:
{
"TABLESCHEMA":{
"TABLEA":{
"PLANT": ["PLANTA"],
"MATERIAL": ["MATERIAL"],
"CALDAY": ["FECHAPERIODO"],
"CALMONTH": ["FECHACORTA"]
},
"TABLEB":{
"PLANT": ["PLANTA"],
"MATERIAL": ["MATERIAL"],
"CALMONTH": ["FECHACORTA"]
}
}
}
Every time that this python script founds a table TABLEA on schema TABLESCHEMA, it goes for its columns definitions. If it founds a column name listed on exceptions list, then it applies the specified “definition”, on any other case it keeps HANA DATA TYPE value.
Finally, you have two options to use data generated with this script: you can insert it directly to HANA table or you can have a CSV file that can be later uploaded to a HANA table. From my point of view, it will depend on how many rows you want to generate and how fast you want to have your information. If you only generate CSV data, the process will end relatively fast and then you can import generated data using IMPORT sql instructions; if you decide to insert it directly to defined HANA table, it will toke more time, but you won’t have to import it on an additional step. You may face one restriction or another if you need to work on an On Premise installation or on a Cloud instance.
After you extract (or clone) this script on a directory, you can run this script with this parameters:
python genDummy.py -s SCHEMA -t TABLEA -f exceptions.json -r 1000 -d 20170101 -o
where:
◈ -s: schema name where table is located
◈ -t: table name for which you are generating this data
◈ -f: file with exceptions definition
◈ -r: number of rows that this script will generate
◈ -d: additional parameter that (in this particular case) was a date used for generating a particular value
◈ -o: if present, a CSV file will be generated; if not present, data will be inserted directly on HANA table
No comments:
Post a Comment