Perl Script to help feed DBMS MySQL

**KALANI84** · 11-11-2009

I started in Perl and I am reading the doc to learn things. My problem is simple, I have multiple files that I just need to manipulate, format to feed my MySQL database. I use MySQL 5.1, Toad Perl v4.1 & v5.10

This is one example among many other CSV file:

Code:

245362;77;77289;INR69907;Felt;799;20091102232406;;VEL;799;66;24;
243640;77;66017;INT81553;"Notebook";699;20091102232318;;VEL;699;66;24;
236958;77;08;DRT03286;bed;299;20091102231945;;VEL;299;66;15;
319771;77;79060;INA38898;Pen;399;20091102231902;;VEL;399;66;3;
319771;77;79060;BTQ38898;Pen;399;20091102231902;;VEL;399;66;24;
319250;77;13110;INJ40768;Mattresses;599;20091102231736;;VEL;599;66;24;
320343;77;21758;PRP31628;Mice;299;20091102231459;;VEL;299;66;24;
331379;77;27209;INT59377;Kit;299;20091102231319;;VEL;299;66;15;
314147;77;89301;BTQ12922;Scissors;0;20091102231248;;VEL;0;66;24;
334467;77;26234;TEL38873;Handkerchief;499;20091102231131;;VEL;499;66;24;
241749;77;37732;TEL90191;Eraser;699;20091102231104;;VEL;699;66;24;

The desired result:

Code:

insert into vod_billing values("236958","8","DRT03286","bed","299","20091102231945","15","DRT" );
insert into vod_billing values("241749","37732","TEL90191","Eraser","699","20091102231104","24","TEL" );
insert into vod_billing values("243640","66017","INT81553","Notebook","699","20091102232318","24","INT" );
insert into vod_billing values("245362","77289","INR69907","Felt","799","20091102232406","24","INT" );
insert into vod_billing values("314147","89301","BTQ12922","Scissors","0","20091102231248","24","BTQ" );
insert into vod_billing values("319250","13110","INJ40768","Mattresses","599","20091102231736","24","INT" );
insert into vod_billing values("319771","79060","BTQ38898","Pen","399","20091102231902","24","BTQ" );
insert into vod_billing values("319771","79060","INA38898","Pen","399","20091102231902","3","INT" );
insert into vod_billing values("320343","21758","PRP31628","Mice","299","20091102231459","24","PRP" );
insert into vod_billing values("331379","27209","INT59377","Kit","299","20091102231319","15","INT" );
insert into vod_billing values("334467","26234","TEL38873","Handkerchief","499","20091102231131","24","TEL" );

it gives this:
1.concatenate all *.csv* a directory into a single file
2.replace all characters "by anything (double quote)
3.Remove all duplicate lines
4.Sortby the first column
5. Remove all empty lines
6. The character is the delimiter column then delete column 2,8,9,10 & 11
7. To add last column in the first 3 characters of the column 3 (TEL, PRI etc ...)
8. On this last column, replace all INA INJ & Rs per INT
9. Insert before the line vod_billing insert into values ( "
10. To include end of line ");
11. To replace the delimiter, for ""
12. Rename the output file .sql

The ultimate goal would be to directly feed my MySQL database created with lines, but already such a script would be perfect.

I know it's a bit abused but probably simpler for a developer!

**Zecho** · 11-11-2009

There's a difference between helping someone who starts, and her spinning is all scripts. For referrals, I suggest you to go and read a little help on the functions open(), split(), close(). A quick foreach () will also help you.

**CitricAcid** · 11-11-2009

It's all very easy in perl, but what is (an order of magnitude) the size (number of lines) of a file .csv obtained by concatenation of those listed in a directory, in order to see the strategy most appropriate (while memory or not) to remove duplicates and sorting.

**$tatic** · 11-11-2009

Originally Posted by KALANI84

it gives this:
1.concatenate all *.csv* a directory into a single file
2.replace all characters "by anything (double quote)
3.Remove all duplicate lines
4.Sortby the first column
5. Remove all empty lines
6. The character is the delimiter column then delete column 2,8,9,10 & 11

The correct order to do this is:
1. Concatenate all *. csv * a directory into a single file
2. Remove all empty lines
3. The character is the delimiter column then delete column 2,8,9,10 & 11
4. Sort by the first column
5. Delete all duplicate lines
6. Replace all characters "by anything (double quote)

Because otherwise we may have lines that are not duplicates before deleting columns which become then (I do not know the exact structure of data lines, so it implies) [or when the order initially it is important and can be duplicates after deleting columns?]

Delete duplicates after sorting is optimal: the duplicates will then successive, which makes things easier.

**KALANI84** · 11-11-2009

As stated I started doing the script Via Agenda quote (I added in step 7 & 8)

Code:

#!/usr/bin/perl -w
use strict;
use warnings;
# Global Variables
my $LOCAL_DIR = "/cygdrive/d/Nicolas/VOD/_.billing/_.compiler/backup/test/";
my $DEST_DIR = "/cygdrive/d/Nicolas/VOD/_.billing/_.compiler/backup/";
my $SEARCH_FILE = "*_billing_tvod_*";
my $OUTPUT_FILE = $DEST_DIR . sprintf("RESULT_FILE.TXT" );
# Script Perl
print "Starting...\n";
print "\n";
open (OUT,">$OUTPUT_FILE" ) or die "Message : $!\n";
my @RESULT_FILE = <$LOCAL_DIR/$SEARCH_FILE>;
foreach my $LINE(@RESULT_FILE)
{
  open IN,"$LINE" or warn "Message : $!\n";
  while(<IN> )
  {
      s/\"//g;
      next if m/^$/;
      my @SPLIT_TAB = split(/;+/);
      push(@SPLIT_TAB,substr($SPLIT_TAB[3],0,3));
      chomp(@SPLIT_TAB);
      if($SPLIT_TAB[11] eq 'INA')
      {
          my $ERASE_COLUMN = pop(@SPLIT_TAB);
          push(@SPLIT_TAB,'INR');
          chomp(@SPLIT_TAB);
      }
      elsif($SPLIT_TAB[11] eq 'INS')
      {
          my $ERASE_COLUMN = pop(@SPLIT_TAB);
          push(@SPLIT_TAB,'INR');
          chomp(@SPLIT_TAB);
      }
      elsif($SPLIT_TAB[11] eq 'INJ')
      {
          my $ERASE_COLUMN = pop(@SPLIT_TAB);
          push(@SPLIT_TAB,'INR');
          chomp(@SPLIT_TAB);
      }
      print OUT "$SPLIT_TAB[0] $SPLIT_TAB[2] $SPLIT_TAB[3] $SPLIT_TAB[4] $SPLIT_TAB[5] $SPLIT_TAB[6] $SPLIT_TAB[10] $SPLIT_TAB[11]\n";
      #print OUT $_;
  }
}
print "\n";
print "Ending...\n";
END

I managed tasks 1, 2, 6, 7 & 8 and probably 3.

For task 3 deletion is not done because I'm selecting some columns, but I think it is not optimized.

For task 4 I can not include it in the foreach so it'll have that I treat this in a second file from the 45/46 line is that it?

Stupid question, how I leave my RESULT_FILE.TXT file I created? So to put it in a new table for the lot?

For task 5 I found a function with an array of hash on the result but one of my columns can be seen truncated. I do not know if it's the function or fate previous:
sub doublons_grep {
my %hash_sans_doublon;

return grep {! $hash_sans_doublon{$_}++ }@_;
}

To answer questions of CitricAcid:
The order of magnitude is about 30000 lines.
There is no restriction to the elimination of duplication before or after sorting, no matter.