Choice of data structure

**Hashim** · 02-11-2009

I'm going to redo much of my work because I have chosen the wrong data structure for the access files that are more expensive. All my work is based on files of the same results intermediaries. The purpose of my job is to find a solution to my problem but to underestimate the possible execution time (the measurement time by using the 'clock ()'.

But when I tested the time of execution on the comparison of two files according to defined criteria then I noticed that the time has become very long if we have files with hundreds of lines. We will read each line of file 1 and compared with all lines of file 2. It's very expensive.

Do you:
- With what data structures these two files will be replaced?

- In general, what is the data structure the least costly, memory and execution time so the lowest possible?

**!const** · 02-11-2009

Can you give us even more details? I mean with the information that you have given, it is harder to determine what database structure would be feasible and best suited for you and for your project.

**Hashim** · 02-11-2009

In my C program, the comparison function is:

Code:

*rets = compare_files("f.txt", "f2.txt" );

I passed two files "e.txt" and "f2.txt" to function 'compare_files'. The size of "e.txt" is always greater than "f2.txt". Each line contains two files for a string.

I am looking for lines that belong to "e.txt" and not "f2.txt". It is a kind of difference: lines of "e.txt" minus lines of "f2.txt"; provided a line of "e.txt" is identical to a line of "f2.txt". If both lines have the same value and the same number of words that form the two lines regardless of the order of words since the word order is not important in my problem. The most important thing: the same value and the same number

If that is the two lines have the same value and the same number of words then in this case the two lines are different.

For example:
"name surname age" = "name age surname"

Knowing that each line of the two files "e.txt" and "f2.txt" is composed of a single field (a string)

Be the first file "e.txt"
name surname
Name Age
NAME AGE
first_name Employment
name surname age
Job name
age employment
first_name age employment
Name Age Employment
name surname Employment
name surname age employment

Either the second file "f2.txt"

name
first_name
age
Employment
age name
name surname age
Job name
age employment
first_name age employment
name surname Employment

Applying the principle of comparison when this is achieved through:
name surname
Name Age Employment
name surname age employment

Then, we keep this result that the lines which does not contain another string with the result.

Here, we do not keep the string "name surname age employment" because it contains the string "name surname"

The desired end result is obtained:
name surname
Name Age Employment

Here is the function code 'compare_files'

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define OUTPUT_NAME "finalresult.txt"
#define MAX_SIZE 1024
void * xrealloc(void * prec, size_t len)
{
    void *ret = realloc(prec, len);
    if(!ret) 
        exit(0xDEAD);
    return ret;
}

char * mstrndup(const char *str, size_t n)
{
    char *ret = malloc((n+1) * sizeof(char));
    if (!ret)
        exit(0xDEAD);
    strncpy(ret, str, n); 
    ret[n] = 0; 
    return ret;
}

void free_tab(char **t, size_t len)
{
    if (t)
    {
        size_t i;
        for (i = 0; i < len; i++)
            free(t[i]);
        free(t);
    }
}

void free_tab2(char **t)
{
    size_t i = 0;
    while (t && t[i])
        free(t[i++]);
    free(t);
}

size_t count_word(const char *str)
{
    size_t n = 0;
    int a;
    while (*str) 
    {
        a = 0; 
        while (isalpha((unsigned char) *str) && *str) str++, a = 1; 
        if (a) n++; 
        while (!isalpha((unsigned char) *str) && *str) str++; 
    }
    return n;
}

void get_word(char **tab, const char *str)
{
    const char* p = str;
    int a, i = 0;
    while (*str)
    {
        a = 0;
        while (isalpha((unsigned char) *p) && *p) p++, a = 1;
        if (a)
            tab[i++] = mstrndup(str, p-str); 
        while (!isalpha((unsigned char) *p) && *p) p++;
        str = p;
    }
}

int compareline(char **t1, size_t size1, char **t2, size_t size2)
{
    int ret = 1;
    size_t i, j;
    int a;
    for (i = 0; i < size1; i++)
    {
        a = 0;
        for (j = 0; j < size2; j++)
            if (!strcmp(t1[i], t2[j]))
            {
                a = 1;
                break; 
            }
        if (!a)
        {
            ret = 0;
            break; 
        }
    }
    return ret;
}

int is_same(const char *s1, const char *s2, int comparesize)
{
    char **t1, **t2;
    size_t size1 = count_word(s1), size2 = count_word(s2);
    int ret = 0;
    if (!comparesize || (size1 > size2)) 
    {
        t1 = malloc(size1 * sizeof(char*));
        t2 = malloc(size2 * sizeof(char*));
        if (t1 && t2)
        {
            get_word(t1, s1);
            get_word(t2, s2);
            ret = comparesize? compareline(t2, size2, t1, size1) : compareline(t1, size1, t2, size2);
            free_tab(t1, size1), free_tab(t2, size2);
        }
        else
            exit(0xDEAD);
    }
    return ret;
}
char ** compare_files(const char *filename1, const char *filename2)
{
    FILE *f = fopen(filename1, "r" ), *f2 = fopen(filename2, "r" );
    char s[MAX_SIZE], s2[MAX_SIZE];
    int a, retsize = 0;
    char **ret = NULL;
    if (f && f2) 
    {
        while (fgets(s, MAX_SIZE, f)) 
        {
            a = 0;
            rewind(f2);
            while (fgets(s2, MAX_SIZE, f2))
                if (is_same(s, s2, 0))
                {
                    a = 1;
                    break;
                }
            if (!a)
            {
                ret = xrealloc(ret, (++retsize) * sizeof(char*)); 
                ret[retsize-1] = mstrndup(s, strlen(s));
            }
        }
        fclose(f), fclose(f2);
        ret = xrealloc(ret, (++retsize) * sizeof(char*)); 
        ret[retsize-1] = NULL;
    }
    else 
    {
        if (f) fclose(f);
        if (f2) fclose(f2);
    }
    return ret;
}
void write_tab(FILE *f, char **t, size_t len)
{
    size_t i;
    for (i = 0; t && i < len; i++)
        if (t[i])
            fprintf(f, t[i]);
}
void disp_tab(char **t)
{
    size_t i;
    for (i = 0; t && t[i]; i++)
        printf(t[i]);
}
int main(void)
{
#define MAXFILE 3
#define MAXTMP 10
#define N 4
    char **rets[MAXFILE] = {NULL}, tmp[MAXTMP];
    char **interret = NULL;
    size_t interret_size = 0;
    int i,j,k;
    int a, b;
    int equN = 0, equN1 = 0, equothr = 0, equi = 0;
    FILE *output;
    *rets = compare_files("f.txt", "f1.txt" );
    i = 0;
    while(*rets && (*rets)[i]) 
    {
        a = count_word((*rets)[i]); 
        if(a == N) 
          equN++, equi = i;
        else if (a == (N-1)) 
          equN1++;
        else if (a < (N-1))
          equothr = 1; 
        i++;
    }
    if(equN == 1 && !equothr) 
    {
        if(equN1) 
        {
            interret = *rets; 
            interret_size = i; 
            free((*rets)[equi]), (*rets)[equi] = NULL; 
        }
        else 
        {
          interret = *rets;
          interret_size = 1;
          for(i = 1; rets[0][i]; i++)
            free(rets[0][i]), rets[0][i] = NULL;
        }
    }
    else
    {
      for (i = 1; i < MAXFILE; i++) 
      {
          sprintf(tmp, "f%d.txt", i+1);
          rets[i] = compare_files("f.txt", tmp); 
      }
      i = 0;
      while (rets[0] && rets[0][i]) 
      {
          a = b = 0;
          for (j = 1; j < MAXFILE; j++) 
          {
              k = 0;
              while (rets[j] && rets[j][k]) 
              {
                  if (!strcmp(rets[0][i], rets[j][k])) 
                  {
                      a = 1; 
                      break;
                  }
                  k++;
              }
              if (!a) 
              {
                  b = 1;
                  break;
              }
          }
          if (!b) 
          {
              interret = xrealloc(interret, (++interret_size) * sizeof(char*));
              interret[interret_size-1] = mstrndup(rets[0][i], strlen(rets[0][i])); 
          }
          i++;
      }
      for (i = 0; i < MAXFILE; i++) 
          free_tab2(rets[i]);
      for (i = 0; i < interret_size; i++) 
      {
          for (j = 0; interret[i] && j < interret_size; j++) 
          {
              if (interret[j] && j != i) 
                  if (is_same(interret[i], interret[j], 1)) 
                  {
                      free(interret[i]), interret[i] = NULL; 
                      break;
                  }
          }
      }
    }
    if((output = fopen(OUTPUT_NAME, "w+" )) != NULL) 
    {
        write_tab(output, interret, interret_size);
        fclose(output);
    }
    free_tab(interret, interret_size); 
    return 0;
}

What would you think?

**!const** · 02-11-2009

My friend, simply write the main functions instead of showing the whole code which is useless to us. And by the way what is interret of your functions that wrap your libc?

**Big Fish** · 02-11-2009

Your files are how big? Is this you can keep them in memory rather than to reread n1 * n2 times? Is what you could not replace your strings with stuff that would represent a unique (checksums perhaps?) and that the comparison would be faster? (should ask someone other than myself to these aspects)

**Hashim** · 02-11-2009

@ big fish: When I measure the execution time of this treatment comparison using the 'clock ()' then I get a higher time even thousands of seconds for a file of hundreds of lines.

Code:

debut = clock();
//Comparison between two files
fin = clock();
fprintf(stderr, "temps : %f\n", (double)(fin-debut) / (double) CLOCKS_PER_SEC);

The file may contain hundreds of lines and each line contains a string (all words). I wanted to make them fit in memory but how to do this? and with what data structure we can do this?