Project 3: A CSV Parser
Reading time: 20 minutes
Difficulty Level: ★★★★★
Start early; Start Often!
Due date: April 23 23:59
Updates and Revisions
- [April 22 09:51] Updated walker pointer example. See Edstem post.
- [April 21 14:51] Added clarification of parse() input.
- [April 21 13:58] Added submission instructions.
- [April 19 15:56] Added sections on empty columns and testing.
Table of contents
Learning Goals
In this assignment, we will–
- learn about CSV files,
- start using pointers to access and modify data,
- practice more advanced string manipulation,
- learn how to parse command line arguments using
getopt()
, - use
fgets()
to read from files, - test our own code for correctness, and of course,
- practice using the terminal and vim some more.
Introduction
This time, we will be writing a program to read and parse CSv (comma-separated values) files.
CSV Files
This is a file format that uses comma-separated values to store data. You can simply think of a CSV file as a table with rows and columns. Each line in a CSV file represents a row, and the columns in each row are separated by a delimiter, which in the case of CSV files, is a comma (','
).
Examples
Consider the following table:
Name | Race | Age | Weapon |
---|---|---|---|
Frodo Baggins | Hobbit | 51 | Sting |
Aragorn | Human | 87 | Anduril |
Gandalf | Istari | Unknown | Staff |
Samwise Gamgee | Hobbit | 38 | Cast iron pan |
To store this table using comma-separated values, we get:
Name,Race,Age,Weapon
Frodo Baggins,Hobbit,51,Sting
Aragorn,Human,87,Anduril
Gandalf,Istari,Unknown,Staff
Samwise Gamgee,Hobbit,38,Cast iron pan
Empty Columns
It is totally acceptable to have empty columns in a CVS file: There would simply be no value between two commas for an empty column:
Example:
Name | Race | Age | Weapon |
---|---|---|---|
Gandalf | Istari | Staff | |
Samwise Gamgee | Hobbit | 38 |
Converting the above table to csv, we get
Name,Race,Age,Weapon
Gandlaf,Istari,,Staff
Samwise Gamgee,Hobbit,38,
You might ask, what if some data cell contains the delimiter character? Consider a CSV file using commas to separate values. What if we want to store the following address?
6755 Mira Mesa Blvd Ste 111, San Diego, CA 92121
If we store it directly, then it would be interpreted as three separated values.
To address this, values containing the delimiter character would have quotes around them (e.g.,
"Lorem, ipsum"
).However, in this PA, you can assume such conflicts will not happen.
The select
program
Your task for this programming assignment is to create a command line program called select
. This program reads CSV files from standard input (just like in PA 2) and selects desired columns.
The program is run like so:
Usage: ./select -c NCOLS [-n NRECS] COL1 COL2 COL3 ...
Let’s break it down–
./select
: The name of the executable program.-c NCOLS
: The number of columns to read from the input CSV.[-n NRECS]
: Optional. The maximum number of rows to parse.COL1 COL2 COL3 ...
: The indices of columns to output (1-indexed).
Usage Example
Consider the same CSV file from an earlier section. Let’s call this file lotr.csv
:
Name,Race,Age,Weapon
Frodo Baggins,Hobbit,51,Sting
Aragorn,Human,87,Anduril
Gandalf,Istari,Unknown,Staff
Samwise Gamgee,Hobbit,38,Cast iron pan
To select the Name
and Weapon
columns, we can run the following:
$ ./select -c 4 1 4 < lotr.txt
Name,Weapon
Frodo Baggins,Sting
Aragorn,Anduril
Gandalf,Staff
Samwise Gamgee,Cast iron pan
- The
-c 4
argument specifies there are 4 columns to read from the input CSV. - The subsequent
1 4
indicate the columns we wish to select and output. < lotr.txt
, as you will recall from last time, redirects the contents of thelotr.txt
file as input to theselect
program. (This is a feature of the shell, not something you need to implement.)
The output column indexes can be repeated:
$ ./select -c 4 1 2 1 3 4 4 < lotr.txt Name,Race,Name,Age,Weapon,Weapon Frodo Baggins,Hobbit,Frodo Baggins,51,Sting,Sting Aragorn,Human,Aragorn,87,Anduril,Anduril Gandalf,Istari,Gandalf,Unknown,Staff,Staff Samwise Gamgee,Hobbit,Samwise Gamgee,38,Cast iron pan,Cast iron pan
Processing Command Line Arguments
The first task for you is to implement command line arguments parsing.
The Anatomy of Command Line Arguments
Generally speaking, there are three kinds of command line arguments. Consider the following compiling command to compile a simple C file hello.c
into an executable program called hello
, with debugging features turned on (-g
):
$ gcc -o hello -g hello.c
In this command, we see 3 kinds of arguments:
- Flag Only: This would be like the
-g
flag when we compile a program with gcc. No value is specified. - Flag + Value: This would be like the
-o file
flag when we specify the executable name when compiling with gcc. - Positional arguments: This would be the final
hello.c
. It does not follow any flag like-g
or-o
.
Let’s revisit the command line arguments that our select
program supports:
Usage: ./select -c NCOLS -n NRECS COL1 COL2 COL3 ...
Here, we have:
-c NCOLS
is a flag + value combination option,-n NRECS
is also a flag + value combination option, andCOL1 COL2 COL3 ...
are all positional arguments.
The main
function: argc
and argv
You should be familiar with the Java way of handling command line arguments. As you no doubt recall, the Java main
method looks like this:
public static void main(String[] args) {
// ...
}
The String[] args
is an array of Strings, each representing one command line option passed in from the terminal. The number of arguments can be deduced by checking the length of the args
array (args.length
).
In C, however, arrays do not store their lengths. So there is no such thing as args.length
in C. This is why we have two separate parameters in the main
function:
int argc
: the number of command line arguments (lit., “argument count”).char *argv[]
: an array of command line arguments stored as strings (lit., “argument values). Recall that in C, strings are of typechar *
, soargv
is an array ofchar *
values.
Parsing Command Line Arguments Using getopt()
If you look at our starter code for PA 2, you will notice the use of the getopt()
function in the beginning of main()
to proces command line arguments.
Now, let’s learn about using getopt()
so you can write your own command line arguments handling code for this PA.
Suppose we wish to write a program named prog
that can be run like this:
./prog -a -b NUM ARG1 ARG2 ARG3 ...
Where -a
is a flag-only option, -b NUM
is a flag + value option, specifically, it accepts a number as an argument (so we will need to parse the string into a number), and everything after these (ARG1 ARG2 ARG3 ...
) are positional arguments.
We can write the following code to process these arguments:
int a_flag = 0; // 1 if -a flag is set, 0 if not set
int b_value = 0; // stores the value of -b flag
int c;
while ((c = getopt(argc, argv, "ab:")) != -1) {
switch (c) {
case 'a': // -a flag detected
a_flag = 1;
break;
case 'b': // -b flag detected
b_value = atoi(optarg); // convert argument value to int
break;
}
}
// After getopt processes all the flags, the positional arguments
// begin at index `optind`:
for (int i = optind; i < argc; i++) {
printf("argv[%d] = %s\n", i, argv[i]);
}
The getopt
function call
We call getopt
in a while loop and give it both argc
and argv
. To specify the flags to scan for, we also pass in a string "ab:"
.
The "ab:"
string tells the getopt
function that there are two flags to look out for: -a
and -b
. In addition, the colon (':'
) in "b:"
indicates that the -b
flag should be followed by a value.
In the body of the while loop, we use a switch-case statement to process specific flags.
The switch-case statement is another kind of conditional statement. In the code snipppet above, the switch-case statement is equivalent to
if (c == 'a') { a_flag = 1; } else if (c == 'b') { b_value = atoi(optarg); }
When writing a switch-case statement, do not forget the
break;
statement at the end of each case!
Using optarg
Note that when handling the -b
flag, we use the optarg
to obtain the value provided by the user from the command line.
This variable is defined by the library. So you do not need to declare it yourself. (There is no declaration for optarg
in our code snippet above.)
The optarg
variable is a string (char *
). If you want the value to be an integer, you will need to use the atoi()
function to convert it. You can see it in action in the example above.
Handling Remaining Positional Arguments Using optind
Once the while loop running getopt
finishes, all the flags will have been processed. All that remains to be done is processing the positional arguments, but where do they begin?
For this, the library defines another integer variable int optind
, which stores the index of the next argument to be processed. After getopt
finishes processing all the flag arguments, optind
would store the index of the next arguments, which is our first positional argument.
We then use a for loop to process all the remaining arguments.
String Splitting: The Basics
So how does parsing a CSV file actually work? As you will later see in the starter code, the CSV file is processed one line at a time, and one line is one string. So our task is boiled down to this: how do we break down a string into smaller parts? To understand this, we need to understand how a string is stored in memory.
Let’s take this hypothetical line/row from our CSV file:
Frodo,Hobbit,51\n
(Notice the newline character ('\n'
) at the end!)
When our program runs, this line will be read from the file into a string. And as you should know by now, a string in C is a null-terminated array of characters.
Let’s draw some ASCII art to represent the array of characters:
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ , │ H │ o │ b │ b │ i │ t │ , │ 5 │ 1 │ \n │ \0 │
You can see that the characters are laid out sequentially in memory, and the entire string is terminated by a null character ('\0'
). The variable line
is actually a pointer to the very beginning of the string, i.e., the first character.
Our goal is to end up with 3 separate null-terminated strings: "Frodo"
, "Hobbit"
, and "51"
.
To achieve this, we can simply replace the delimiter (','
and '\n'
) with null characters! If we do that, we get the following:
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ \0 │ H │ o │ b │ b │ i │ t │ \0 │ 5 │ 1 │ \0 │ \0 │
↑ ↑ ↑
str1 str2 str3
Now, we have three null-terminated strings: str1
which is "Frodo"
, str2
which is "Hobbit"
, and str3
which is "51"
.
As you will see in later sections, we want to break down the original string into an array of strings. So in this same example, the original string is:
char *line = "Frodo,Hobbit,51";
We want to end up with an array of strings called values
:
char *values[] = {"Frodo", "Hobbit", "51"};
It is critical that you understand the logic here for splitting strings completely. If not, you will not be able to do this assignment. If you need help understanding, please utilize office hours and tutor hours liberally!
Getting Started
Now that we understand our goal, let’s discuss the implementation.
Again, the starter code for this assignment is hosted on GitHub Classroom. Use the following link to accept the GitHub Classroom assignment:
Click here to accept this GitHub Classroom assignment. (Right click to open in new tab.)
Just like last time, clone the repository to your ieng6
server. (Do not clone the repo to your local machine.)
Introduction to Code Modularization and the Makefile
Until now, we have used the gcc
command directly to compile a single file. However, as you may know, more complicated programs benefit from modularization. This means the source code is spread across multiple files.
For this PA, the code is spread across two files:
main.c
: Takes care of argument parsing and error handling.parse.c
: The core CSV parsing logic.
With a modularized C program consisting of multiple C files, we no longer type out the gcc command ourselves when we build the executable. Instead, we rely on a build tool called make
, which will follow the build recipes described in the Makefile
.
To compile the program, simply type make
in the terminal. Since the starter code is incomplete, you may get compiler warnings/errors when you try to compile.
Do not attempt to compile the program by running gcc
on either of the .c
files! Only use make
to compile. (It’s easier anyway.)
Next, let’s look at the starter code, starting with main.c
:
Task 1 (main.c
): Writing getopt
Your first task is writing the code to handle command line arguments. As a reminder, our select
program is run as follows:
Usage: ./select -c NCOLS [-n NRECS] COL1 COL2 COL3 ...
We have written the skeleton of the getopt()
call and the while loop, just like we showed you earlier in this document. You task is to write the body of the while loop so that the program correctly parses the command line arguments, and stores the values in the corresponding variables ncols
and nrows
.
After the while loop, there are some error checking code to make sure the arguments are entered correctly. (E.g. the user never entered the number of columns via the -c
flag, or if they did not specify any output columns.)
Task 2 (main.c
): Setting outidx
The outidx
array is an array of integers. This array keeps track of the columns to be selected. (idx
is a commonly used shorthand for “index”.)
int outidx[MAX_COLS];
The important thing to note here is that the array is initialized to be quite large: MAX_COLS
is defined to be 100, which means we can specify up to 100 columns to output. (You can assume that this limit will not be exceeded.) However, you don’t have to use all 100 spaces in this array.
For instance, if the program is run like so:
$ ./select -c 10 -n 50 1 3 4 6 7 9 10
(Translation: the CSV input has 10 columns in each row, process the first 50 rows, and select the following columns: 1, 3, 4, 6, 7, 9, and 10.)
In this case, the outidx
array will look like {1, 3, 4, 6, 7, 9, 10, ...}
. The ...
here means that technically the array has more spaces, but they are irrelevant for the program.
In fact, you will see in the starter code a count
variable, which keeps track of how many columns are supposed to be selected. So the outidx
array is really only meaningful up to index count - 1
. In the previous example, count would have a value of 7, and outidx[0] = 1
, outidx[1] = 3
, outidx[2] = 4
, …, outidx[6] = 10
.
Task 3 (main.c
, parse.c
): Processing input
In the last part of the main()
function, you will find another while loop. This should look familiar to you, as we also had a similar while loop in the previous PA, which also called fgets()
to get output from the terminal.
The string you get from fgets()
includes a newline character ('\n'
)!
This while loop does two things:
- Call the
parse()
function to break down one line of input into thevalues
array, and - Print the selected columns using the
values
array andoutidx
array.
If the
parse()
function returns -1 while processing line N of the CSV input, you should print the following error message tostderr
and exit the program with a failure code:“Invalid syntax at line N”
where N is replaced by the actual line number.
The code looks something like this:
if (parse(/* arguments here */) == -1) { fprintf(stderr, INVALID_SYNTAX_MSG, lineno); return EXIT_FAILURE; }
The
INVALID_SYNTAX_MSG
is a macro:#define INVALID_SYNTAX_MSG "Invalid syntax at line %d.\n"
Note that here is also where you should handle the -n
command line argument, which specifies how many rows of input should be processed.
In the following section, we dive into the most important part of this assignment:
The parse()
Function
Go to parse.c
and you will see the parse()
function:
int parse(char *line, int ncols, char *values[])
The comment above the function documents what each parameter does.
To visualize what the parse()
function does to the line of CSV data that it gets, let’s return to the previous example of "Frodo,Hobbit,51\n"
.
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ , │ H │ o │ b │ b │ i │ t │ , │ 5 │ 1 │ \n │ \0 │
In the example above, the first parameter line
is "Frodo,Hobbit,51\n"
, and the second parameter ncols
is 3 because there are three columns in the CSV file.
We do not care what data is in the values
array at the beginning, because we are using it to store the results of our own computations.
Your parse()
function should be able to handle an input line
that ends with a newline character! If not, your code will fail some autograder test cases!
The high-level goal of this function is to take this line of CSV data, split it by the delimiter (i.e., the comma (','
)), and store the split strings (i.e. each column value "Frodo"
, "Hobbit"
, "51
”) into the values
array.
When we say “store a string”, what’s really happening is we are storing a pointer to the start of the string, i.e., a char *
(character pointer).
In the end, we should have values[0] = "Frodo"
, values[1] = "Hobbit"
, and values[2] = "51"
. However, it is more important to realize what that looks like in memory:
char *line = "Frodo,Hobbit,51";
↓
│ F │ r │ o │ d │ o │ \0 │ H │ o │ b │ b │ i │ t │ \0 │ 5 │ 1 │ \0 │ \0 │
↑ ↑ ↑
values[0] values[1] values[2]
As you can see, values[0]
stores a pointer to the start of "Frodo"
, values[1]
stores a pointer to the start of "Hobbit"
, and so on.
No new strings are allocated in this process. In other words, we have not allocated more memory for these three values. We are merely manipulating the original string to split it into three strings (by putting in null characters!)
Finding Delimiters Using Pointers!
To find all the delimiter (commas) in the line and replace them with null characters, we will simply traverse the string from start to finish, and process one character at a time.
You are only allowed to use a pointer to traverse the string! You will not receive any points for this PA if you used index-based traversal.
How do you traverse a string using pointers? Here’s a very simple example:
// Use a pointer to print out a string character by character
char *str = "This PA is hard!!";
char *c = str; // this is the pointer that will "walk through" the string
while (*c != '\0') { // as long as the null terminator hasn't been reached
printf("%c", *c); // print out each character
c++; // "walk" to the next character
}
printf("\n"); // print a newline after everything
Let’s again draw a diagram for this code:
char *str = "This PA is hard!\n";
↓
│ T │ h │ i │ s │ │ P │ A │ │ i │ s │ │ h │ a │ r │ d │ ! │ \0 │
↑
char *c = str;
As you can see, str
is a character pointer that points to the start of the string/array of characters. When we assign the walker pointer c
the value of str
, we are basically saying they both point to the same thing. So, as you can see in the diagram, both pointers str
and c
point to the same position.
When we want to access the character at the pointer, we simply dereference: *c
.
To traverse the string, we need to make the c
pointer walk through the string. So we need to advance it one character at a time: c++
. (Not the programming language!)
After you increment the c
pointer by doing c++
, it’ll look like this:
// after c++
char *str = "This PA is hard!\n";
↓
│ T │ h │ i │ s │ │ P │ A │ │ i │ s │ │ h │ a │ r │ d │ ! │ \0 │
↑
c (incremented)
You will implement your parse()
function based on this traversal! Of course, your code will be much more complicated than this because it needs to do much more.
What If 1: There are more columns than
ncols
?This is perfectly acceptable. You can simply stop processing after
ncols
number of columns have been processed and you have obtainedncols
substrings.What If 2: There are fewer columns than
ncols
?This would be an error. Because you are expecting at least
ncols
columns to be present in the string, but after traversing the entire string, you still have not found enough columns.If that is the case, the
parse()
function should return -1.
Pointers are hard to understand. They are also the most powerful feature of the C programming language, and arguably the essence of this class.
Do not hesitate to come to office hours for help.
And as you have probably realized while reading this writeup, diagrams are really helpful! You should draw a lot of diagrams if you have trouble working with pointers. Be patient. It takes practice.
Developing, Testing, and Debugging
Rule #1 is always run your code. You are not going to know anything about your code unless you run it! Once again we reiterate the importance of incremental development and testing.
If you don’t develop and test your program incrementally, bugs will pile up and bury you underneath!
To help you test your code, we have provided some example csv files in the starter repository. Just like in PA 2, you can feed these files into your program using redirection.
We have also provided you with a reference implementation. Please use this to check your own implementation!
More instructions will become available on testing.
Testing Your Own Code
In principle, we should always make sure the general use case for a program is working before moving on to testing edge cases/corner cases.
Testing the correctness of your code is about asking the right questions. Here are some questions you should ask yourself when you are testing your code:
- Do I understand how the program is run?
- Does my program correctly parse the command line arguments?
- Are the corresponding variables set correctly?
- Are all the selected columns stored correctly in the
outidx
array?
- Does the program read each line correctly?
- Is the program able to correctly identify delimiters?
- Is the program splitting up substrings (columns) correctly?
- Is the
parse()
function filling in thevalues
array correctly? - Can my program handle duplicate selected columns?
This is in no way an exhaustive list of things you should think about as you test your program. It is your job to test your program extensively before submitting to gradescope.
Do not use Gradescope as a way to verify your implementation. You should be reasonably confident in your own code (having tested it a lot) before even making the first submission!
Committing Using Git
As we have explained in discussion and in lab, git
is a great way to do version control. And it goes naturally with incremental development.
Every time you finish a small functionality, you should commit your code.
To commit your code, you need to be in your code repository directory, and run the following command:
$ git commit -am 'commit message here.'
The -am
flag is a combination of -a
(add all files) and -m
(commit message).
For the commit message, you should write something that is at least minimally meaningful. Here are some decent examples:
- Implemented key1 validation and error message.
- Fixed minor bug with key1 validation.
- Implemented loop to covert all input to lower case letters.
- Implemented basic version of encrypt.
Here are some not so great examples:
- .
- asdfasd
- why won’t you work
- aaaarrrgggghhh
- stupid bug
- ******* ****
- wrote some ****
All of these are commit messages that I have written throughout my short career, but you should aim to write yours more like the first list. At least when you are in a decent mood.
Pushing to GitHub
Every once in a while, maybe even after each commit, it would be a good idea to push your progress to GitHub in case the CSE building burns down and the ieng6 servers are destroyed.
To do that, navigate to your code repository directory, and run
$ git push
Simple as that.
Submission
Submitting from GitHub
First, make sure that the most up-to-date version of your code has been pushed to GitHub. If you want to double check, go to the GitHub web page and find your repository to make sure the code changes you made on ieng6 are actually all there.
Having done that, go to Gradescope and find the submission page for this PA. When you submit the assignment, you will see the option to submit from a specific GitHub repository. You will need to link your GitHub account first, after which you should be able to find your pa2 repository. Select it, and choose the main
branch.
We have not discussed branching in git yet, so unless you know what you are doing, you should only see main
for branch options.
Gradescope Autograder
You score will always be 0 before the deadline! Please see the results for specifc test cases to understand how your code performed.
After you submit your PA on gradescope, you’ll notice there are a set of public test cases and a (significantly larger) set of hidden test cases. As the names would imply, you will only have access to the public test cases, though you will be graded on the result of ALL the test cases.
Note that when we test your parse.c
implementation, we have our own simple test program (instead of using the more complicated select
program) called parse
.