In the business intelligence world there are times we may need to work with unstructured lists and files. For example we may have to load survey questions, and answers and respondents answers. But the thing is all surveys have different questions, respondents and answers and we don’t want to update our ETL package every time there is a new survey, a question is added/ removed from the existing survey or there are new respondents. Because of that we can’t just simply create a template that we normally would in a data flow and expect to handle all future changes.
Now that we know there are times we can’t avoid using scripts in our ETL package. The most common unstructured data source usually comes from MS Excel and there is a question what library I should use to connect to Excel in a Server Environment? The answer might not a simple statement and depends on the environment you are working on.
The first library we can use is Microsoft.Office.Interop.Excel. Even though it is very slow this library makes Excel feel like as if you are working on a database source. It is a very rich library to work with Excel especially when someone has lots of looping to do. Well there is always “but” after each statement. If you have Excel on you DEV machine and everything might be running smoothly and you may download the library directly from Microsoft and deploy your package on the server. You would expect everything should be fine but you are most like likely to see this Error Message. System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.Office.Interop.Excel Version=14.0.0.0 Culture=neutral PublicKeyToken=71e9bce111e9429c’ or one of its dependencies. The system cannot find the file specified. File name: ‘Microsoft.Office.Interop.Excel Version=14.0.0.0
The problem is the library expects Excel to be running on the server. If you are a developer and your DBA is okay with installing Excel the Server and bear all the license issues and the security holes it might create for you; I would say Microsoft.Office.Interop. Excel by far is the friendliest library I have a chance to work with Excel. What most likely to happen is you will get a red flag and work all over again with a different library.
If you decide to use OLEDB provider besides changing your data source you may have to change few more things in you actual code to make it to work with the way you want. Next I will discuss some of the changes that need to be made in the actual code for the script task to work properly in SSIS.
1. If the Excel files have long column headers or different formats or special characters:
OLEDB provider consider the top row as a header by default and it is very hard to make modification and the best solution can be force the provider not to consider the first row as header. This can be done by adding HDR=NO in your connection statement.
For example you can have connection string like this one.
string EXLConnString = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + strFile + ";Excel 8.0; HDR=NO; IMEX=1;Persist Security Info=False"; |
2. If you have column headers or null values for different rows. It is possible that you will see F(number) such as F2, F12,F22 as column headers and numbers given as values. To solve this do not simply assign you data reader as a data source to your data table. First populate your data set then use the data set as a data source for your data table.
//Connection String for Excel string EXLConnString = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + strFile + ";Excel 8.0; HDR=NO; IMEX=1;Persist Security Info=False"; OleDbConnection EXLConn = new OleDbConnection(EXLConnString); EXLConn.Open(); //SELECT everything SQL Command for Excel String sSQL = "SELECT * FROM [Sheet1$];"; //Create a data adapter and fill the data command to it OleDbCommand EXLComd = new OleDbCommand(sSQL, EXLConn); OleDbDataAdapter oDataAdapter = new OleDbDataAdapter(); oDataAdapter.SelectCommand = EXLComd; //Create a data set and populate using data adapter DataSet oDataSet = new DataSet(); oDataAdapter.Fill(oDataSet); //populate the data table using dataset DataTable oDataTable = new DataTable(); oDataTable = oDataSet.Tables[0]; |
3. If you have any logic that compares a variable with null while you were using from Interop library you have to have additional logic in you if that compares with empty String.
if (row != null && row != String.Empty) //or use if (row!= null && row!= "") |
These might be the most common problems you may have to fix but expect have more challenges with this library. In my opinion OLEDB provider is the safest choice for the server environment with all their hassles.
Even though this is a database and business intelligence blog and it the libraries are described in terms of how we could modify our code in SSIS script task the same can be said for web developers and windows application developers at times where their application has to work in the environment where there is no Excel application running.