Project 3: A CSV Parser

Reading time: 20 minutes

Difficulty Level: ★★★★★

Start early; Start Often!

Due date: April 23 23:59

Updates and Revisions

  • [April 22 09:51] Updated walker pointer example. See Edstem post.
  • [April 21 14:51] Added clarification of parse() input.
  • [April 21 13:58] Added submission instructions.
  • [April 19 15:56] Added sections on empty columns and testing.
Table of contents
  1. Learning Goals
  2. Introduction
    1. CSV Files
      1. Examples
      2. Empty Columns
  3. The select program
    1. Usage Example
  4. Processing Command Line Arguments
    1. The Anatomy of Command Line Arguments
    2. The main function: argc and argv
    3. Parsing Command Line Arguments Using getopt()
      1. The getopt function call
      2. Using optarg
      3. Handling Remaining Positional Arguments Using optind
  5. String Splitting: The Basics
  6. Getting Started
    1. Introduction to Code Modularization and the Makefile
  7. Task 1 (main.c): Writing getopt
  8. Task 2 (main.c): Setting outidx
  9. Task 3 (main.c, parse.c): Processing input
  10. The parse() Function
    1. Finding Delimiters Using Pointers!
  11. Developing, Testing, and Debugging
    1. Testing Your Own Code
    2. Committing Using Git
    3. Pushing to GitHub
  12. Submission
    1. Submitting from GitHub
    2. Gradescope Autograder

Learning Goals

In this assignment, we will–

  • learn about CSV files,
  • start using pointers to access and modify data,
  • practice more advanced string manipulation,
  • learn how to parse command line arguments using getopt(),
  • use fgets() to read from files,
  • test our own code for correctness, and of course,
  • practice using the terminal and vim some more.

Introduction

This time, we will be writing a program to read and parse CSv (comma-separated values) files.

CSV Files

This is a file format that uses comma-separated values to store data. You can simply think of a CSV file as a table with rows and columns. Each line in a CSV file represents a row, and the columns in each row are separated by a delimiter, which in the case of CSV files, is a comma (',').

Examples

Consider the following table:

Name Race Age Weapon
Frodo Baggins Hobbit 51 Sting
Aragorn Human 87 Anduril
Gandalf Istari Unknown Staff
Samwise Gamgee Hobbit 38 Cast iron pan

To store this table using comma-separated values, we get:

Name,Race,Age,Weapon
Frodo Baggins,Hobbit,51,Sting
Aragorn,Human,87,Anduril
Gandalf,Istari,Unknown,Staff
Samwise Gamgee,Hobbit,38,Cast iron pan

Empty Columns

It is totally acceptable to have empty columns in a CVS file: There would simply be no value between two commas for an empty column:

Example:

Name Race Age Weapon
Gandalf Istari   Staff
Samwise Gamgee Hobbit 38  

Converting the above table to csv, we get

Name,Race,Age,Weapon
Gandlaf,Istari,,Staff
Samwise Gamgee,Hobbit,38,

You might ask, what if some data cell contains the delimiter character? Consider a CSV file using commas to separate values. What if we want to store the following address?

6755 Mira Mesa Blvd Ste 111, San Diego, CA 92121

If we store it directly, then it would be interpreted as three separated values.

To address this, values containing the delimiter character would have quotes around them (e.g., "Lorem, ipsum").

However, in this PA, you can assume such conflicts will not happen.

The select program

Your task for this programming assignment is to create a command line program called select. This program reads CSV files from standard input (just like in PA 2) and selects desired columns.

The program is run like so:

Usage: ./select -c NCOLS [-n NRECS] COL1 COL2 COL3 ...

Let’s break it down–

  • ./select: The name of the executable program.
  • -c NCOLS: The number of columns to read from the input CSV.
  • [-n NRECS]: Optional. The maximum number of rows to parse.
  • COL1 COL2 COL3 ...: The indices of columns to output (1-indexed).

Usage Example

Consider the same CSV file from an earlier section. Let’s call this file lotr.csv:

Name,Race,Age,Weapon
Frodo Baggins,Hobbit,51,Sting
Aragorn,Human,87,Anduril
Gandalf,Istari,Unknown,Staff
Samwise Gamgee,Hobbit,38,Cast iron pan

To select the Name and Weapon columns, we can run the following:

$ ./select -c 4 1 4 < lotr.txt
Name,Weapon
Frodo Baggins,Sting
Aragorn,Anduril
Gandalf,Staff
Samwise Gamgee,Cast iron pan
  • The -c 4 argument specifies there are 4 columns to read from the input CSV.
  • The subsequent 1 4 indicate the columns we wish to select and output.
  • < lotr.txt, as you will recall from last time, redirects the contents of the lotr.txt file as input to the select program. (This is a feature of the shell, not something you need to implement.)

The output column indexes can be repeated:

$ ./select -c 4 1 2 1 3 4 4 < lotr.txt
Name,Race,Name,Age,Weapon,Weapon
Frodo Baggins,Hobbit,Frodo Baggins,51,Sting,Sting
Aragorn,Human,Aragorn,87,Anduril,Anduril
Gandalf,Istari,Gandalf,Unknown,Staff,Staff
Samwise Gamgee,Hobbit,Samwise Gamgee,38,Cast iron pan,Cast iron pan

Processing Command Line Arguments

The first task for you is to implement command line arguments parsing.

The Anatomy of Command Line Arguments

Generally speaking, there are three kinds of command line arguments. Consider the following compiling command to compile a simple C file hello.c into an executable program called hello, with debugging features turned on (-g):

$ gcc -o hello -g hello.c

In this command, we see 3 kinds of arguments:

  1. Flag Only: This would be like the -g flag when we compile a program with gcc. No value is specified.
  2. Flag + Value: This would be like the -o file flag when we specify the executable name when compiling with gcc.
  3. Positional arguments: This would be the final hello.c. It does not follow any flag like -g or -o.

Let’s revisit the command line arguments that our select program supports:

Usage: ./select -c NCOLS -n NRECS COL1 COL2 COL3 ...

Here, we have:

  1. -c NCOLS is a flag + value combination option,
  2. -n NRECS is also a flag + value combination option, and
  3. COL1 COL2 COL3 ... are all positional arguments.

The main function: argc and argv

You should be familiar with the Java way of handling command line arguments. As you no doubt recall, the Java main method looks like this:

public static void main(String[] args) {
    // ...
}

The String[] args is an array of Strings, each representing one command line option passed in from the terminal. The number of arguments can be deduced by checking the length of the args array (args.length).

In C, however, arrays do not store their lengths. So there is no such thing as args.length in C. This is why we have two separate parameters in the main function:

  1. int argc: the number of command line arguments (lit., “argument count”).
  2. char *argv[]: an array of command line arguments stored as strings (lit., “argument values). Recall that in C, strings are of type char *, so argv is an array of char * values.

Parsing Command Line Arguments Using getopt()

If you look at our starter code for PA 2, you will notice the use of the getopt() function in the beginning of main() to proces command line arguments.

Now, let’s learn about using getopt() so you can write your own command line arguments handling code for this PA.

Suppose we wish to write a program named prog that can be run like this:

./prog -a -b NUM ARG1 ARG2 ARG3 ...

Where -a is a flag-only option, -b NUM is a flag + value option, specifically, it accepts a number as an argument (so we will need to parse the string into a number), and everything after these (ARG1 ARG2 ARG3 ...) are positional arguments.

We can write the following code to process these arguments:

int a_flag = 0; // 1 if -a flag is set, 0 if not set
int b_value = 0;    // stores the value of -b flag
int c;
while ((c = getopt(argc, argv, "ab:")) != -1) {
    switch (c) {
        case 'a': // -a flag detected
            a_flag = 1;
            break;
        case 'b': // -b flag detected
            b_value = atoi(optarg); // convert argument value to int
            break;
    }
}

// After getopt processes all the flags, the positional arguments
// begin at index `optind`:
for (int i = optind; i < argc; i++) {
    printf("argv[%d] = %s\n", i, argv[i]);
}

The getopt function call

We call getopt in a while loop and give it both argc and argv. To specify the flags to scan for, we also pass in a string "ab:".

The "ab:" string tells the getopt function that there are two flags to look out for: -a and -b. In addition, the colon (':') in "b:" indicates that the -b flag should be followed by a value.

In the body of the while loop, we use a switch-case statement to process specific flags.

The switch-case statement is another kind of conditional statement. In the code snipppet above, the switch-case statement is equivalent to

if (c == 'a') {
    a_flag = 1;
} else if (c == 'b') {
    b_value = atoi(optarg);
}

When writing a switch-case statement, do not forget the break; statement at the end of each case!

Using optarg

Note that when handling the -b flag, we use the optarg to obtain the value provided by the user from the command line.

This variable is defined by the library. So you do not need to declare it yourself. (There is no declaration for optarg in our code snippet above.)

The optarg variable is a string (char *). If you want the value to be an integer, you will need to use the atoi() function to convert it. You can see it in action in the example above.

Handling Remaining Positional Arguments Using optind

Once the while loop running getopt finishes, all the flags will have been processed. All that remains to be done is processing the positional arguments, but where do they begin?

For this, the library defines another integer variable int optind, which stores the index of the next argument to be processed. After getopt finishes processing all the flag arguments, optind would store the index of the next arguments, which is our first positional argument.

We then use a for loop to process all the remaining arguments.

String Splitting: The Basics

So how does parsing a CSV file actually work? As you will later see in the starter code, the CSV file is processed one line at a time, and one line is one string. So our task is boiled down to this: how do we break down a string into smaller parts? To understand this, we need to understand how a string is stored in memory.

Let’s take this hypothetical line/row from our CSV file:

Frodo,Hobbit,51\n

(Notice the newline character ('\n') at the end!)

When our program runs, this line will be read from the file into a string. And as you should know by now, a string in C is a null-terminated array of characters.

Let’s draw some ASCII art to represent the array of characters:

char *line = "Frodo,Hobbit,51";
       ↓ 
       │ F │ r │ o │ d │ o │ , │ H │ o │ b │ b │ i │ t │ , │ 5 │ 1 │ \n │ \0 │

You can see that the characters are laid out sequentially in memory, and the entire string is terminated by a null character ('\0'). The variable line is actually a pointer to the very beginning of the string, i.e., the first character.

Our goal is to end up with 3 separate null-terminated strings: "Frodo", "Hobbit", and "51".

To achieve this, we can simply replace the delimiter (',' and '\n') with null characters! If we do that, we get the following:

char *line = "Frodo,Hobbit,51";
       ↓ 
       │ F │ r │ o │ d │ o │ \0 │ H │ o │ b │ b │ i │ t │ \0 │ 5 │ 1 │ \0 │ \0 │
       ↑                        ↑                            ↑
     str1                      str2                         str3

Now, we have three null-terminated strings: str1 which is "Frodo", str2 which is "Hobbit", and str3 which is "51".

As you will see in later sections, we want to break down the original string into an array of strings. So in this same example, the original string is:

char *line = "Frodo,Hobbit,51";

We want to end up with an array of strings called values:

char *values[] = {"Frodo", "Hobbit", "51"};

It is critical that you understand the logic here for splitting strings completely. If not, you will not be able to do this assignment. If you need help understanding, please utilize office hours and tutor hours liberally!

Getting Started

Now that we understand our goal, let’s discuss the implementation.

Again, the starter code for this assignment is hosted on GitHub Classroom. Use the following link to accept the GitHub Classroom assignment:

Click here to accept this GitHub Classroom assignment. (Right click to open in new tab.)

Just like last time, clone the repository to your ieng6 server. (Do not clone the repo to your local machine.)

Introduction to Code Modularization and the Makefile

Until now, we have used the gcc command directly to compile a single file. However, as you may know, more complicated programs benefit from modularization. This means the source code is spread across multiple files.

For this PA, the code is spread across two files:

  • main.c: Takes care of argument parsing and error handling.
  • parse.c: The core CSV parsing logic.

With a modularized C program consisting of multiple C files, we no longer type out the gcc command ourselves when we build the executable. Instead, we rely on a build tool called make, which will follow the build recipes described in the Makefile.

To compile the program, simply type make in the terminal. Since the starter code is incomplete, you may get compiler warnings/errors when you try to compile.

Do not attempt to compile the program by running gcc on either of the .c files! Only use make to compile. (It’s easier anyway.)

Next, let’s look at the starter code, starting with main.c:

Task 1 (main.c): Writing getopt

Your first task is writing the code to handle command line arguments. As a reminder, our select program is run as follows:

Usage: ./select -c NCOLS [-n NRECS] COL1 COL2 COL3 ...

We have written the skeleton of the getopt() call and the while loop, just like we showed you earlier in this document. You task is to write the body of the while loop so that the program correctly parses the command line arguments, and stores the values in the corresponding variables ncols and nrows.

After the while loop, there are some error checking code to make sure the arguments are entered correctly. (E.g. the user never entered the number of columns via the -c flag, or if they did not specify any output columns.)

Task 2 (main.c): Setting outidx

The outidx array is an array of integers. This array keeps track of the columns to be selected. (idx is a commonly used shorthand for “index”.)

int outidx[MAX_COLS];

The important thing to note here is that the array is initialized to be quite large: MAX_COLS is defined to be 100, which means we can specify up to 100 columns to output. (You can assume that this limit will not be exceeded.) However, you don’t have to use all 100 spaces in this array.

For instance, if the program is run like so:

$ ./select -c 10 -n 50 1 3 4 6 7 9 10 

(Translation: the CSV input has 10 columns in each row, process the first 50 rows, and select the following columns: 1, 3, 4, 6, 7, 9, and 10.)

In this case, the outidx array will look like {1, 3, 4, 6, 7, 9, 10, ...}. The ... here means that technically the array has more spaces, but they are irrelevant for the program.

In fact, you will see in the starter code a count variable, which keeps track of how many columns are supposed to be selected. So the outidx array is really only meaningful up to index count - 1. In the previous example, count would have a value of 7, and outidx[0] = 1, outidx[1] = 3, outidx[2] = 4, …, outidx[6] = 10.

Task 3 (main.c, parse.c): Processing input

In the last part of the main() function, you will find another while loop. This should look familiar to you, as we also had a similar while loop in the previous PA, which also called fgets() to get output from the terminal.

The string you get from fgets() includes a newline character ('\n')!

This while loop does two things:

  1. Call the parse() function to break down one line of input into the values array, and
  2. Print the selected columns using the values array and outidx array.

If the parse() function returns -1 while processing line N of the CSV input, you should print the following error message to stderr and exit the program with a failure code:

“Invalid syntax at line N”

where N is replaced by the actual line number.

The code looks something like this:

if (parse(/* arguments here */) == -1) {
    fprintf(stderr, INVALID_SYNTAX_MSG, lineno);
    return EXIT_FAILURE;
}

The INVALID_SYNTAX_MSG is a macro:

#define INVALID_SYNTAX_MSG "Invalid syntax at line %d.\n"

Note that here is also where you should handle the -n command line argument, which specifies how many rows of input should be processed.

In the following section, we dive into the most important part of this assignment:

The parse() Function

Go to parse.c and you will see the parse() function:

int parse(char *line, int ncols, char *values[])

The comment above the function documents what each parameter does.

To visualize what the parse() function does to the line of CSV data that it gets, let’s return to the previous example of "Frodo,Hobbit,51\n".

char *line = "Frodo,Hobbit,51";
       ↓ 
       │ F │ r │ o │ d │ o │ , │ H │ o │ b │ b │ i │ t │ , │ 5 │ 1 │ \n │ \0 │

In the example above, the first parameter line is "Frodo,Hobbit,51\n", and the second parameter ncols is 3 because there are three columns in the CSV file.

We do not care what data is in the values array at the beginning, because we are using it to store the results of our own computations.

Your parse() function should be able to handle an input line that ends with a newline character! If not, your code will fail some autograder test cases!

The high-level goal of this function is to take this line of CSV data, split it by the delimiter (i.e., the comma (',')), and store the split strings (i.e. each column value "Frodo", "Hobbit", "51”) into the values array.

When we say “store a string”, what’s really happening is we are storing a pointer to the start of the string, i.e., a char * (character pointer).

In the end, we should have values[0] = "Frodo", values[1] = "Hobbit", and values[2] = "51". However, it is more important to realize what that looks like in memory:

char *line = "Frodo,Hobbit,51";
       ↓ 
       │ F │ r │ o │ d │ o │ \0 │ H │ o │ b │ b │ i │ t │ \0 │ 5 │ 1 │ \0 │ \0 │
       ↑                        ↑                            ↑
    values[0]                values[1]                     values[2]

As you can see, values[0] stores a pointer to the start of "Frodo", values[1] stores a pointer to the start of "Hobbit", and so on.

No new strings are allocated in this process. In other words, we have not allocated more memory for these three values. We are merely manipulating the original string to split it into three strings (by putting in null characters!)

Finding Delimiters Using Pointers!

To find all the delimiter (commas) in the line and replace them with null characters, we will simply traverse the string from start to finish, and process one character at a time.

You are only allowed to use a pointer to traverse the string! You will not receive any points for this PA if you used index-based traversal.

How do you traverse a string using pointers? Here’s a very simple example:

// Use a pointer to print out a string character by character
char *str = "This PA is hard!!";
char *c = str;  // this is the pointer that will "walk through" the string
while (*c != '\0') { // as long as the null terminator hasn't been reached
    printf("%c", *c);  // print out each character
    c++; // "walk" to the next character
}
printf("\n");  // print a newline after everything

Let’s again draw a diagram for this code:

char *str = "This PA is hard!\n";
       ↓ 
       │ T │ h │ i │ s │   │ P │ A │   │ i │ s │   │ h │ a │ r │ d │ ! │ \0 │
       ↑
char *c = str;

As you can see, str is a character pointer that points to the start of the string/array of characters. When we assign the walker pointer c the value of str, we are basically saying they both point to the same thing. So, as you can see in the diagram, both pointers str and c point to the same position.

When we want to access the character at the pointer, we simply dereference: *c.

To traverse the string, we need to make the c pointer walk through the string. So we need to advance it one character at a time: c++. (Not the programming language!)

After you increment the c pointer by doing c++, it’ll look like this:

// after c++
char *str = "This PA is hard!\n";
       ↓ 
       │ T │ h │ i │ s │   │ P │ A │   │ i │ s │   │ h │ a │ r │ d │ ! │ \0 │
           ↑
           c (incremented)

You will implement your parse() function based on this traversal! Of course, your code will be much more complicated than this because it needs to do much more.

What If 1: There are more columns than ncols?

This is perfectly acceptable. You can simply stop processing after ncols number of columns have been processed and you have obtained ncols substrings.

What If 2: There are fewer columns than ncols?

This would be an error. Because you are expecting at least ncols columns to be present in the string, but after traversing the entire string, you still have not found enough columns.

If that is the case, the parse() function should return -1.

Pointers are hard to understand. They are also the most powerful feature of the C programming language, and arguably the essence of this class.

Do not hesitate to come to office hours for help.

And as you have probably realized while reading this writeup, diagrams are really helpful! You should draw a lot of diagrams if you have trouble working with pointers. Be patient. It takes practice.

Developing, Testing, and Debugging

Rule #1 is always run your code. You are not going to know anything about your code unless you run it! Once again we reiterate the importance of incremental development and testing.

If you don’t develop and test your program incrementally, bugs will pile up and bury you underneath!

To help you test your code, we have provided some example csv files in the starter repository. Just like in PA 2, you can feed these files into your program using redirection.

We have also provided you with a reference implementation. Please use this to check your own implementation!

More instructions will become available on testing.

Testing Your Own Code

In principle, we should always make sure the general use case for a program is working before moving on to testing edge cases/corner cases.

Testing the correctness of your code is about asking the right questions. Here are some questions you should ask yourself when you are testing your code:

  1. Do I understand how the program is run?
  2. Does my program correctly parse the command line arguments?
    1. Are the corresponding variables set correctly?
    2. Are all the selected columns stored correctly in the outidx array?
  3. Does the program read each line correctly?
  4. Is the program able to correctly identify delimiters?
  5. Is the program splitting up substrings (columns) correctly?
  6. Is the parse() function filling in the values array correctly?
  7. Can my program handle duplicate selected columns?

This is in no way an exhaustive list of things you should think about as you test your program. It is your job to test your program extensively before submitting to gradescope.

Do not use Gradescope as a way to verify your implementation. You should be reasonably confident in your own code (having tested it a lot) before even making the first submission!

Committing Using Git

As we have explained in discussion and in lab, git is a great way to do version control. And it goes naturally with incremental development.

Every time you finish a small functionality, you should commit your code.

To commit your code, you need to be in your code repository directory, and run the following command:

$ git commit -am 'commit message here.'

The -am flag is a combination of -a (add all files) and -m (commit message).

For the commit message, you should write something that is at least minimally meaningful. Here are some decent examples:

  1. Implemented key1 validation and error message.
  2. Fixed minor bug with key1 validation.
  3. Implemented loop to covert all input to lower case letters.
  4. Implemented basic version of encrypt.

Here are some not so great examples:

  1. .
  2. asdfasd
  3. why won’t you work
  4. aaaarrrgggghhh
  5. stupid bug
  6. ******* ****
  7. wrote some ****

All of these are commit messages that I have written throughout my short career, but you should aim to write yours more like the first list. At least when you are in a decent mood.

Pushing to GitHub

Every once in a while, maybe even after each commit, it would be a good idea to push your progress to GitHub in case the CSE building burns down and the ieng6 servers are destroyed.

To do that, navigate to your code repository directory, and run

$ git push

Simple as that.

Submission

Submitting from GitHub

First, make sure that the most up-to-date version of your code has been pushed to GitHub. If you want to double check, go to the GitHub web page and find your repository to make sure the code changes you made on ieng6 are actually all there.

Having done that, go to Gradescope and find the submission page for this PA. When you submit the assignment, you will see the option to submit from a specific GitHub repository. You will need to link your GitHub account first, after which you should be able to find your pa2 repository. Select it, and choose the main branch.

We have not discussed branching in git yet, so unless you know what you are doing, you should only see main for branch options.

Gradescope Autograder

You score will always be 0 before the deadline! Please see the results for specifc test cases to understand how your code performed.

After you submit your PA on gradescope, you’ll notice there are a set of public test cases and a (significantly larger) set of hidden test cases. As the names would imply, you will only have access to the public test cases, though you will be graded on the result of ALL the test cases.

Note that when we test your parse.c implementation, we have our own simple test program (instead of using the more complicated select program) called parse.