In this video you will understand the origins and applications of the R programming language Want to take the interactive coding exercises and earn a certificate? Join DataCamp today, and start the free introduction to R tutorial: https://www.datacamp.com/courses/free-introduction-to-r === Hi! My name is Filip and I'm a data scientist at DataCamp. DataCamp is an online data science school. You'll take fun video lessons, like the one you're watching now and solve interactive coding challenges, where you receive instant and detailed feedback. All this happens in the comfort of your browser, so you can immediately start learning the skill of the future. In this introduction to R course you will learn about the basics of R, as well as the most common data structures it uses to store data. By the end of this course, you will know how to create these data structures, manipulate them and perform calculations on them to get surprising insights. But first things first: the basics of R. It's also called the language for statistical computing, and is one of the most popular languages to do data science, used by tons of companies and universities around the globe in all sorts of fields. Optimizing a financial portfolio? Mapping marketing data? Analyzing outcomes of clinical trials? You name it, R can handle it. But why did R become so popular? Well, first of all, it's free to use! Next, R's visualization capabilities are top notch, making it easy to build beautiful plots. It's also easy to create so-called packages, which are extensions to R. R's very active community has created thousands of these packages for many different fields. Last but not least, R is an actual programming language, with a command-line interface for executing code. This is a big plus compared to other point-and-click programs out there. It might take some energy to fully get the hang of it, but feat not: DataCamp is here to help you master R in no time! Let's get started. An important component of R, is the console. It's a place where you can execute R commands. In DataCamp's interactive interface, the console can be found here. Let's try to calculate the sum of 1 and 2. We simply type 1 + 2 at the prompt the console and hit Enter. R interprets what you typed and prints the result. R is more than a scientific calculator, though. You can also create so-called variables. A variable allows you to store data in R for later use. You can use the less than sign followed by a dash to create a variable. Suppose the height of a rectangle is 2. Let's assign this value 2 to a variable height. In the console, we type height, less than sign, dash, 2: This time, R does not print anything, because it assumes that you will be using this variable in the future. If you now simply type and execute height in the console, R returns 2: We can do a similar thing for the width of our imaginary rectangle. We assign the value 4 to a variable width. Typing width gives us 4, great. As you're assigning variables in the R console, you're actually accumulating the R workspace. It's the place where R variables 'live'. You can list all variables with the ls() function. Simply type ls followed by empty parentheses and hit enter. This shows you a list of all the variables you have created up to now. There are two objects in your workspace at the moment, height and width. I we try to access variable that's not in the workspace, depth for example, R throws an error. Suppose you now want to find out the area of our imaginary rectangle, which is height multiplied by width. height equals 2, and width equals 4, so the result is 8. Let's also assign this result to a new variable, area. Inspecting the workspace again with ls, shows that the workspace contains three objects now: area, height and width. Now, this is all great, but what if you want to recalculate the area of your imaginary rectangle when the height is 3 and the width is 6? You'd have to reassign the variables width and height in the console, and then recalculate the area. That's quite some coding you'd have to redo, isn't it? This is the place where R scripts come in! An R script is simply a text file with succesive lines of R code. Let's create such a script, "rectangle.R", that contains the code that we've written up to now. Next, you can run this script. In the DataCamp interface, you can do this with the 'Submit Answer' button. R goes through your code, line by line, executing every command one by one in the console, just as if you are typing each command yourself. The cool thing is, that if you want to change your code, you can simply adapt your script and run it again. Let's change the height to 3 and the width to 6, and rerun the script. The variables are given different values this time, and the output changes accordingly.
Views: 167202 DataCamp
Discover the power of the data frame in R! Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r By now, you already learned quite some things in R. Data structures such as vectors, matrices and lists have no secrets for you anymore. However, R is a statistical programming language, and in statistics you'll often be working with data sets. Such data sets are typically comprised of observations, or instances. All these observations have some variables associated with them. You can have for example, a data set of 5 people. Each person is an instance, and the properties about these people, such as for example their name, their age and whether they have children are the variables. How could you store such information in R? In a matrix? Not really, because the name would be a character and the age would be a numeric, these don't fit in a matrix. In a list maybe? This could work, because you can put practically anything in a list. You could create a list of lists, where each sublist is a person, with a name, an age and so on. However, the structure of such a list is not really useful to work with. What if you want to know all the ages for example? You'd have to write a lot of R code just to get what you want. But what data structure could we use then? Meet the data frame. It's the fundamental data structure to store typical data sets. It's pretty similar to a matrix, because it also has rows and columns. Also for data frames, the rows correspond to the observations, the persons in our example, while the columns correspond to the variables, or the properties of each of these persons. The big difference with matrices is that a data frame can contain elements of different types. One column can contain characters, another one numerics and yet another one logicals. That's exactly what we need to store our persons' information in the dataset, right? We could have a column for the name, which is character, one for the age, which is numeric, and one logical column to denote whether the person has children. There still is a restriction on the data types, though. Elements in the same column should be of the same type. That's not really a problem, because in one column, the age column for example, you'll always want a numeric, because an age is always a number, regardless of the observation. So, for the practical part now: creating a data.frame. In most cases, you don't create a data frame yourself. Instead, you typically import data from another source. This could be a csv file, a relational database, but also come from other software packages like Excel or SPSS. Of course, R provides ways to manually create data frames as well. You use the data dot frame function for this. To create our people data frame that has 5 observations and 3 variables, we'll have to pass the data frame function 3 vectors that are all of length five. The vectors you pass correspond to the columns. Let's create these three vectors first: `name`, `age` and `child`. Now, calling the data frame function is simple: The printout of the data frame already shows very clearly that we're dealing with a data set. Notice how the data frame function inferred the names of the columns from the variable names you passed it. To specify the names explicitly, you can use the same techniques as for vectors and lists. You can use the names function, ... , or use equals sings inside the data frame function to name the data frame columns right away. Like in matrices, it's also possible to name the rows of the data frame, but that's generally not a good idea so I won't detail on that here. Before you head over to some exercises, let me shortly discuss the structure of a data frame some more. If you look at this structure, ..., there are two things you can see here: First, the printout looks suspiciously similar to that of a list. That's because, under the hood, the data frame actually is a list. In this case, it's a list with three elements, corresponding to each of the columns in the data frame. Each list element is a vector of length 5, corresponding to the number of observations. A requirement that is not present for lists is that the length of the vectors you put in the list has to be equal. If you try to create a data frame with 3 vectors that are not all of the same length, you'll get an error. Second, the name column, which you expect to be a character vector, is actually a factor. That's because R by default stores the strings as factors. To suppress this behaviour, you can set the stringsAsFactors argument of the data.frame function to FALSE Now, the name column actually contains characters. With this new knowledge, you're ready for some first exercises on this extremely useful and powerful data structure.
Views: 66455 DataCamp
Learn more about credit risk modeling with R: https://www.datacamp.com/courses/introduction-to-credit-risk-modeling-in-r Hi, and welcome to the first video of the credit risk modeling course. My name is Lore, I'm a data scientist at DataCamp and I will help you master some basics of the credit risk modeling field. The area of credit risk modeling is all about the event of loan default. Now what is loan default? When a bank grants a loan to a borrower, which could be an individual or a company, the bank will usually transfer the entire amount of the loan to the borrower. The borrower will then reimburse this amount in smaller chunks, including some interest payments, over time. Usually these payments happen monthly, quarterly or yearly. Of course, there is a certain risk that a borrower will not be able to fully reimburse this loan. This results in a loss for the bank. The expected loss a bank will incur is composed of three elements. The first element is the probability of default, which is the probability that the borrower will fail to make a full repayment of the loan. The second element is the exposure at default, or EAD, which is the expected value of the loan at the time of default. You can also look at this as the amount of the loan that still needs to be repaid at the time of default. The third element is loss given default, which is the amount of the loss if there is a default, expressed as a percentage of the EAD. Multiplying these three elements leads to the formula of expected loss. In this course, we will focus on the probability of default. Banks keep information on the default behavior of past customers, which can be used to predict default for new customers. Broadly, this information can be classified in two types. The first type of information is application information. Examples of application information are income, marital status, et cetera. The second type of information, behavioral information, tracks the past behavior of customers, for example the current account balance and payment arrear history. Let's have a look at the first ten lines of our data set. This data set contains information on past loans. Each line represents one customer and his or her information, along with a loan status indicator, which equals 1 if the customer defaulted, and 0 if the customer did not default. Loan status will be used as a response variable and the explanatory variables are the amount of the loan, the interest rate, grade, employment length, home ownership status, the annual income and the age. The grade is the bureau score of the customer, where A indicates the highest class of creditworthiness and G the lowest. This bureau score reflects the credit history of the individual and is the only behavioral variable in the data set. For an overview of the data structure for categorical variables, you can use the CrossTable() function in the gmodels package. Applying this function to the home ownership variable, you get a table with each of the categories in this variable, with the number of cases and proportions. Using loan status as a second argument, you can look at the relationship between this factor variable and the response. By setting prop.r equal to TRUE and the other proportions listed here equal to FALSE, you get the row-wise proportions. Now what does this result tell you? It seems that the default rate in the home ownership group OTHER is quite a bit higher than the default rate in, for example, the home ownership group MORTGAGE, with 17.5 versus 9.8 percent of defaults in these groups, respectively. Now, let's explore other aspects of the data using R.
Views: 29003 DataCamp
First video of our latest course by Daniel Chen: Cleaning Data in Python. Like and comment if you enjoyed the video! A vital component of data science involves acquiring raw data and getting it into a form ready for analysis. In fact, it is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose your data for problems to dealing with missing values and outliers. At the end of the course, you'll apply all of the techniques you've learned to a case study in which you'll clean a real-world Gapminder dataset! So you've just got a brand new dataset and are itching to start exploring it. But where do you begin, and how can you be sure your dataset is clean? This chapter will introduce you to the world of data cleaning in Python! You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows. Try the first chapter for free: https://www.datacamp.com/courses/cleaning-data-in-python
Views: 12947 DataCamp
Learn more about text mining: https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words Hi, I'm Ted. I'm the instructor for this intro text mining course. Let's kick things off by defining text mining and quickly covering two text mining approaches. Academic text mining definitions are long, but I prefer a more practical approach. So text mining is simply the process of distilling actionable insights from text. Here we have a satellite image of San Diego overlaid with social media pictures and traffic information for the roads. It is simply too much information to help you navigate around town. This is like a bunch of text that you couldn’t possibly read and organize quickly, like a million tweets or the entire works of Shakespeare. You’re drinking from a firehose! So in this example if you need directions to get around San Diego, you need to reduce the information in the map. Text mining works in the same way. You can text mine a bunch of tweets or of all of Shakespeare to reduce the information just like this map. Reducing the information helps you navigate and draw out the important features. This is a text mining workflow. After defining your problem statement you transition from an unorganized state to an organized state, finally reaching an insight. In chapter 4, you'll use this in a case study comparing google and amazon. The text mining workflow can be broken up into 6 distinct components. Each step is important and helps to ensure you have a smooth transition from an unorganized state to an organized state. This helps you stay organized and increases your chances of a meaningful output. The first step involves problem definition. This lays the foundation for your text mining project. Next is defining the text you will use as your data. As with any analytical project it is important to understand the medium and data integrity because these can effect outcomes. Next you organize the text, maybe by author or chronologically. Step 4 is feature extraction. This can be calculating sentiment or in our case extracting word tokens into various matrices. Step 5 is to perform some analysis. This course will help show you some basic analytical methods that can be applied to text. Lastly, step 6 is the one in which you hopefully answer your problem questions, reach an insight or conclusion, or in the case of predictive modeling produce an output. Now let’s learn about two approaches to text mining. The first is semantic parsing based on word syntax. In semantic parsing you care about word type and order. This method creates a lot of features to study. For example a single word can be tagged as part of a sentence, then a noun and also a proper noun or named entity. So that single word has three features associated with it. This effect makes semantic parsing "feature rich". To do the tagging, semantic parsing follows a tree structure to continually break up the text. In contrast, the bag of words method doesn’t care about word type or order. Here, words are just attributes of the document. In this example we parse the sentence "Steph Curry missed a tough shot". In the semantic example you see how words are broken down from the sentence, to noun and verb phrases and ultimately into unique attributes. Bag of words treats each term as just a single token in the sentence no matter the type or order. For this introductory course, we’ll focus on bag of words, but will cover more advanced methods in later courses! Let’s get a quick taste of text mining!
Views: 23495 DataCamp
Learn more about machine learning with R: https://www.datacamp.com/courses/machine-learning-toolbox In the last video, we manually split our data into a single test set, and evaluated out-of-sample error once. However, this process is a little fragile: the presence or absence of a single outlier can vastly change our out-of-sample RMSE. A better approach than a simple train/test split is using multiple test sets and averaging out-of-sample error, which gives us a more precise estimate of true out-of-sample error. One of the most common approaches for multiple test sets is known as "cross-validation", in which we split our data into ten "folds" or train/test splits. We create these folds in such a way that each point in our dataset occurs in exactly one test set. This gives us 10 test sets, and better yet, means that every single point in our dataset occurs exactly once. In other words, we get a test set that is the same size as our training set, but is composed of out-of-sample predictions! We assign each row to its single test set randomly, to avoid any kind of systemic biases in our data. This is one of the best ways to estimate out-of-sample error for predictive models. One important note: after doing cross-validation, you throw all resampled models away and start over! Cross-validation is only used to estimate the out-of-sample error for your model. Once you know this, you re-fit your model on the full training dataset, so as to fully exploit the information in that dataset. This, by definition, makes cross-validation very expensive: it inherently takes 11 times as long as fitting a single model (10 cross-validation models plus the final model). The train function in caret does a different kind of re-sampling known as bootsrap validation, but is also capable of doing cross-validation, and the two methods in practice yield similar results. Lets fit a cross-validated model to the mtcars dataset. First, we set the random seed, since cross-validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly. The train function has a formula interface, which is identical to the formula interface for the lm function in base R. However, it supports fitting hundreds of different models, which are easily specified with the "method" argument. In this case, we fit a linear regression model, but we could just as easily specify method = 'rf' and fit a random forest model, without changing any of our code. This is the second most useful feature of the caret package, behind cross-validation of models: it provides a common interface to hundreds of different predictive models. The trControl argument controls the parameters caret uses for cross-validation. In this course, we will mostly use 10-fold cross-validation, but this flexible function supports many other cross-validation schemes. Additionally, we provide the verboseIter = TRUE argument, which gives us a progress log as the model is being fit and lets us know if we have time to get coffee while the models run. Let's practice cross-validating some models.
Views: 38830 DataCamp
In this video you will understand what the basic data types are in R. Want to take the interactive coding exercises and earn a certificate? Join DataCamp today, and start the free introduction to R tutorial: https://www.datacamp.com/courses/free-introduction-to-r In the previous video you saw that R is also know as the Language for Statistical Computing. Data is the center of any statistical analysis, so let me introduce you to some of R's fundamental data types, also called atomic vector types. Throughout our experiments, we will use the function class(). This is a useful way to see what type a variable is. Let's head over to the console and start with TRUE, in capital letters. TRUE is a logical. That's also what class(TRUE) tells us. Logicals are so-called boolean values, and can be either `TRUE` or `FALSE`. Well, actually, `NA`, to denote missing values, is also a logical, but I won't go into detail on that here. `TRUE` and `FALSE` can be abbreviated to `T` and `F` respectively, as you can see here. However, I want to strongly encourage you to use the full versions, `TRUE` and `FALSE`. Next, let's experiment with numbers. The values 2 and 2.5 are called numerics in R. You can perform all sorts of operations on them such as addition, subtraction, multiplication, division and many more. A special type of numeric is the integer. It is a way to represent natural numbers like 1 and 2. To specify that a number is integer, you can add a capital L to them. You don't see the difference between the integer 2 and the numeric 2 from the output. However, the `class()` function reveals the difference. Instead of asking for the class of a variable, you can also use the is-dot-functions to see whether variables are actually of a certain type. To see if a variable is a numeric, we can use the is-dot-numeric function. It appears that both are numerics. To see if a variable is integer, we can use is-dot-integer. This shows us that integers are numerics, but that not all numerics are integers, so there's some kind of type hierarchy going on here. Last but not least, there's the character string. The class of this type of object is "character". It's important to note that there are other data types in R, such as double for higher precision numerics, complex for handling complex numbers, and raw to store raw bytes. However, you will have tons of fun working with numerics, integers, logicals and characters in the remainder of this introductory course so we'll leave these alone for now. There are cases in which you want to change the type of a variable to another one. How would that work? This is where coercion comes into play! By using the as dot functions one can coerce the type of a variable to another type. Many ways of transformation between types are possible. Have a look at these examples. The first command here coerces the logical TRUE to a numeric. FALSE, however, coerces to the numeric zero. We can also coerce numerics to characters. But what about the other way around? Can you also coerce characters to numerics? Sure you can! You can even convert this character string, "4.5", to an integer, but this implies some information loss, because you cannot keep the decimal part here. But beware: coercion, as in converting data types, is not always possible. Let's try to convert the character "Hello" to a numeric. This conversion outputs an NA, a missing value. R doesn't understand how to transform "Hello" into a numeric, and decides to return a Not Available instead. You already have the essentials on what R is, how to use its basic features and what are the most important data types you will encounter in your R quest. Now head over to the exercises and I'll see you in the next chapter!
Views: 84264 DataCamp
In this introduction to R course you will learn how you can create and name your vectors in R. Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r Hi again! In this video I'll be talking about Vectors. A vector is nothing more than a sequence of data elements of the _same_ basic data type. Remember the atomic vector types I discussed before? You can have character vectors, numeric vectors, logical vectors, and many more. First things first: creating a vector in R! You use the `c()` function for this, which allows you to combine values into a vector. Suppose you're playing a basic card game, and record the suit of 5 cards you draw from a deck. A possible outcome and corresponding vector to contain this information could be this one Of course we could also assign this character vector to a new variable, drawn_suits for example. We now have a character vector, drawn_suits. We can assert that it is a vector, by typing is dot vector drawn_suits Likewise, you could create a vector of integers for example to store how much cards of each suit remain after you drew the 5 cards. Let's call this vector remain. There are 11 more spades, 12 more hearts, 11 diamonds, and all 13 clubs still remain. If you print remain to the console, ..., it looks ok, but it's not very informative. How does somebody else know that the first value corresponds to spades? Wouldn't it be useful if you could attach labels to the vector elements? You can do this in R by naming the vector. You can use the `names()` function for this. Let's first create another character vector, `suits`, that contains the strings "spades", "hearts", "diamonds", and "clubs", the names you want to give your vector elements. Now, this line of code, ..., sets the names of the elements in `remain` to the strings in suits. If you now print remain to the console, ... , you'll see that the suits information is accompanied by the proper labels. Great! If you don't want to bother with setting the names afterwards, you could just as well create a named vector with a one liner. You can use equals signs inside the `c()` function: Notice that here, it's not necessary to surround the names, "spades", "hearts", "diamonds" and "clubs", with double quotes, although this also works perfectly fine. In all three cases, the result is exactly the same. Under the hood, R vectors have attributes associated with them. What you did when you set the names of the remain vector, is actually setting the names attributes of the remains object. The `str()` function, that compactly displays the the structure of an R object, shows this. You'll have plenty of fun creating and naming variables in all sorts of ways, but before I let you to it, there are two more things I want to discuss with you. First of all, remember the variables you've created in the previous chapter? These variables, such as `my_apples`, equal to 5, and `my_oranges`, equal to the character string "six" at some point, are actually all vectors themselves. R does not provide a data structure to hold a single number or a single character string or any other basic data type: they're all just vectors of length 1. You can check this by typing is dot vector my_apples, which is TRUE, and is dot vector my_oranges, which is TRUE as well. That these variables are actually vectors of length 1, can be checked using the length() function. This contrasts with the other vectors we've created in this video: the drawn_suits vector, for example, has length 5. The last important thing is that in R, a vector can only hold elements of the same type. They're also often called *atomic vectors*, to differentiate them from *lists*, another data structure which can hold elements of different types. This means that you cannot have a vector that contains both logicals and numerics, for example. If you do try to build such a vector, R automatically performs coercion to make sure that you end up with a vector that contains elements of the same type. Let's see how that works with an example. In contrast to recording the suits you draw from a deck of cards, suppose now you're recording the _ranks_ of the cards. You might want to combine the result of drawing 8 cards like this, creating a vector drawn_ranks. If you now inspect this vector, you'll see that the numeric vector elements have been coerced to characters, to end up with a homogeneous character vector. This is also what the class function from before tells us. The fact that R handles this for us automatically, 'upgrading' logicals to numerics and numerics to characters when necessary along the way, is useful but can also be dangerous, so be aware of this. If you want to store elements of different types in the same data structure, you'll want to use a list. But that's something for later. Now, it's time to step up your betting game in the interactive exercises!
Views: 70015 DataCamp
Learn more about for loops in R: https://www.datacamp.com/courses/writing-functions-in-r You've seen for loops before, but this is a good time to review them, and then in the following exercises we'll teach you a few new things about for loops you might not have seen before. For loops are used for iteration. Let's take a look at an example from the Intermediate R course. primes_list is a list of the first six primes. Our for() loop says: for each value of i from 1 to the length of primes_list, print the i'th element of primes_list. When we run that in R, i starts at the value 1 and the first element of primes_list is printed, i increases to the next value 2 and the second element of primes_list is printed, and so on until the final iteration where i is 6, the length of primes_list, and the sixth element is printed. Let's examine that loop at little closer. There are three parts common to all for loops. First the sequence. The sequence describes two things, the name we'll give to the object that indexes the iteration, in our case i and the values that this index should iterate over, in our case the integers from 1 to the length of primes_list(). The second part of the for loop is the body. This comes between our curly braces, and describes the operations to iterate, referring back to our index i. In this case, print() the ith element of primes_list. Finally, the third common part of a for loop is the definition of where to store the result. Our loop here doesn't actually have this part, it prints to the screen rather than saving the output. Let's take a look at another example that you will build upon in the following exercises. Here is a data frame, df with four columns, our goal is to loop over the columns, each time calculating the median. What should the sequence be? Well, we want to do something for each column, starting at the first and ending at the last. Let's say i from 1 to ncol(df), the number of columns in df. How about the body? It's very similar to our previous loop, print() the median of the ith column of df. You might not have seen a column pulled from a data frame using double square brackets before. A data frame is built on top of a list, where each element is a column. Since double bracket subsetting pulls out elements inside a list, here it pulls out columns from our data frame. Once, again we haven't got any output, since we are just printing these column medians onto the screen. It would be much more useful if at the end of the loop we had an output vector containing these medians. Over the next few exercises, you'll learn two things. First, a safer way to generate the sequence, and second, how instead of printing the results we could save them into an output object.
Views: 36141 DataCamp
Understand how to create and name your matrices in R. Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r So, what is a matrix. Well, a matrix is kind of like the big brother of the vector. Where a vector is a _sequence_ of data elements, which is one-dimensional, a matrix is a similar collection of data elements, but this time arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional. As with the vector, the matrix can contain only one atomic vector type. This means that you can't have logicals and numerics in a matrix for example. There's really not much more theory about matrices than this: it's really a natural extension of the vector, going from one to two dimensions. Of course, this has its implications for manipulating and subsetting matrices, but let's start with simply creating and naming them. To build a matrix, you use the matrix function. Most importantly, it needs a vector, containing the values you want to place in the matrix, and at least one matrix dimension. You can choose to specify the number of rows or the number of columns. Have a look at the following example, that creates a 2-by-3 matrix containing the values 1 to 6, by specifying the vector and setting the nrow argument to 2: R sees that the input vector has length 6 and that there have to be two rows. It then infers that you'll probably want 3 columns, such that the number of matrix elements matches the number of input vector elements. You could just as well specify ncol instead of nrow; in this case, R infers the number of _rows_ automatically. In both these examples, R takes the vector containing the values 1 to 6, and fills it up, column by column. If you prefer to fill up the matrix in a row-wise fashion, such that the 1, 2 and 3 are in the first row, you can set the `byrow` argument of matrix to `TRUE` Can you spot the difference? Remember how R did recycling when you were subsetting vectors using logical vectors? The same thing happens when you pass the matrix function a vector that is too short to fill up the entire matrix. Suppose you pass a vector containing the values 1 to 3 to the matrix function, and explicitly say you want a matrix with 2 rows and 3 columns: R fills up the matrix column by column and simply repeats the vector. If you try to fill up the matrix with a vector whose multiple does not nicely fit in the matrix, for example when you want to put a 4-element vector in a 6-element matrix, R generates a warning message. Actually, apart from the `matrix()` function, there's yet another easy way to create matrices that is more intuitive in some cases. You can paste vectors together using the `cbind()` and `rbind()` functions. Have a look at these calls `cbind()`, short for column bind, takes the vectors you pass it, and sticks them together as if they were columns of a matrix. The `rbind()` function, short for row bind, does the same thing but takes the input as rows and makes a matrix out of them. These functions can come in pretty handy, because they're often more easy to use than the `matrix()` function. The `bind` functions I just introduced can also handle matrices actually, so you can easily use them to paste another row or another column to an already existing matrix. Suppose you have a matrix `m`, containing the elements 1 to 6: If you want to add another row to it, containing the values 7, 8, 9, you could simply run this command: You can do a similar thing with `cbind()`: Next up is naming the matrix. In the case of vectors, you simply used the names() function, but in the case of matrices, you could assign names to both columns and rows. That's why R came up with the rownames() and colnames() functions. Their use is pretty straightforward. Retaking the matrix `m` from before, we can set the row names just the same way as we named vectors, but this time with the rownames function. Printing m shows that it worked: Setting the column names with a vector of length 3 gives us a fully named matrix Just as with vectors, there are also one-liner ways of naming matrices while you're building it. You use the dimnames argument of the matrix function for this. Check this out.
Views: 54523 DataCamp
In this introduction to R course you will learn about the basics of R, as well as the most common data structures it uses to store data Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r If you have some background in statistics, you'll have heard about categorical variables. Unlike numerical variables, categorical variables can only take on a limited number of different values. Otherwise put, a categorical variable can only belong to a limited number of categories. As R is a statistical programming language, it's not a surprise that there exists a specific data structure for this: factors. If you store categorical data as factors, you can rest assured that all the statistical modelling techniques will handle such data correctly. A good example of a categorical variable is a person's blood type: it can be A, B, AB or O. Suppose we have asked 8 people what their bloodtype is and recorded the information as a vector `blood`. Now, for R it is not yet clear that you're dealing with categorical variables, or factors, here. To convert this vector to a factor, you can use the `factor()` function. The printout looks somewhat different than the original one: there are no double quotes anymore and also the factor levels, corresponding to the different categories, are printed. R basically does two things when you call the factor function on a character vector: first of all, it scans through the vector to see the different categories that are in there. In this case, that's "A", "AB", "B" and "O". Notice here that R sorts the levels alphabetically. Next, it converts the character vector, blood in this example, to a vector of integer values. These integers correspond to a set of character values to use when the factor is displayed. Inspecting the structure reveals this: We're dealing with a factor with 4 levels. The "A"'s are encoded as 1, because it's the first level, "AB" is encoded as 2, "B" as 3 and "O" as 4. Why this conversion? Well, it can be that your categories are very long character strings. Each time repeating this string per observation can take up a lot of memory. By using this simple encoding, much less space is necessary. Just remember that factors are actually integer vectors, where each integer corresponds to a category, or a level. As I said before, R automatically infers the factor levels from the vector you pass it and orders them alphabetically. If you want a different order in the levels, you can specify the levels argument in the factor function. If you compare the structures of `blood_factor` and `blood_factor2`, you'll see that the encoding is different now. Next to changing the order of the levels, it is possible to manually specify the level names, instead of letting R choose them. Suppose that for clarity, you want to display the blood types as `BT_A`, `BT_AB`, `BT_B` and `BT_O`. To name the factor afterwards, you can use the `levels()` function. Similar to the names function to name vectors, you can pass a vector to levels blood_factor. You can also specify the category names, or levels, by specifying the `labels` argument in `factor()`. I admit it, it's a bit confusing. For both of these approaches, it's important to follow the same order as the order of the factor levels: first A, then AB, then B and then O. But this can be pretty dangerous: you might have mistakenly changed the order. To solve this, you can use a combination of manually specifying the `levels` and the `labels` argument when creating a factor. With `levels`, you specify the order, just like before, while with the labels, you specify a new name for the categories: In the world of categorical variables, there's also a difference between nominal categorical variables and ordinal categorical variables. The nominal categorical variables has no implied order. For example, you can't really say the the blood type "O" is greater or less than the blood type "A". "O" is not worth more than "A" in any sense I can think of. Trying such a comparison with factors will generate a warning, telling you that less than is not meaningful: However, there are examples for which such a natural ordering does exist. Consider for example this tshirt vector. It has codes ranging from from small to large. Here, you could say that extra large indeed is greater than, say, a small, right? Of course, R provides a way to impose this kind of order on a factor, thus making it an ordered factor. Inside the factor() function, you simply set the argument ordered to TRUE, and specify the levels in ascending order. Can you so how these less then signs appear between the different factor levels? This compactly shows that we're dealing with an ordered factor now. If we now try to perform a comparison, this call for example, ..., evaluates to TRUE, without a warning message, because a medium was specified to be less than a large.
Views: 70578 DataCamp
Learn more about credit risk modeling in R: https://www.datacamp.com/courses/introduction-to-credit-risk-modeling-in-r We have seen several techniques for preprocessing the data. When the data is fully preprocessed, you can go ahead and start your analysis. You can run the model on the entire data set, and use the same data set for evaluating the result, but this will most likely lead to a result that is too optimistic. One alternative is to split the data into two pieces. The first part of the data, the so-called training set, can be used for building the model and the second part of the data, the test set, can be used to test the results. One common way of doing this is to use two-thirds of the data for a training set and one-third of the data for the test set. Of course there can be a lot of variation in the performance estimate depending which two-thirds of the data you select for the training set. One way to reduce this variation is by using cross validation. For the two-thirds training set and one-third test set example, a cross validation variant would look like this. The data would be split in three equal parts, and each time, two of these parts would act as a training set, and one part would act as a test set. Of course, we could use as many parts as we want, but we would have to run the model many times if using many parts. This may become computationally heavy. In this course, we will just use one training set and one test set containing two-thirds versus one-third of the data, respectively. Imagine we have just run a model, and now we apply the model to our test set to see how good the results are. Evaluating the model for credit risk means comparing the observed outcomes of default versus non-default--stored in the loan_status variable of the test set--with the predicted outcomes according to the model. If we are dealing with a large number of predictions, a popular method for summarizing the results uses something called a confusion matrix. Here, we use just 14 values to demonstrate the concept. A confusion matrix is a contingency table of correct and incorrect classifications. Correct classifications are on the diagonal of the confusion matrix. We see, for example, that 8 non-defaulters were correctly classified as non-default, and 3 defaulters were correctly classified as defaulters. However, we see that 2 non-defaulters where wrongly classified as defaulters, and 1 defaulter was wrongly classified as a non-defaulter. The items on the diagonals are also called the true positives and true negatives. The off-diagonals are called the false positives versus the false negatives. Several measures can be derived from the confusion matrix. We will discuss the classification accuracy, the sensitivity and the specificity. The classification accuracy is the percentage of correctly classified instances, which is equal to 78.57% in this example. The sensitivity is the percentage of good customers that are classified correctly, or 75% in this example. The specificity is the percentage of bad costomers that are classified correctly, or 0.80 in this example. Let's practice splitting the data and constructing confusion matrices.
Views: 12193 DataCamp
Learn how to work with conditional statements in R. Join DataCamp today, and start our intermediate R tutorial for free: https://www.datacamp.com/courses/intermediate-r Have a look at the recipe for the if statement: The `if` statement takes a condition; if the condition evaluates to `TRUE`, the R code associated with the if statement is executed. The condition to check appears inside parentheses, while the R code that has to be executed if the condition is `TRUE`, follows in curly brackets. Let's have a look at an example. Suppose we have a variable `x` equal to -3. If this `x` is smaller than zero, we want R to print out "x is a negative number!". How can we do this using the `if` statement? We first assign the variable, `x` and then write the `if` test. If we run this bit of code, we indeed see that the string "x is a negative number" gets printed out. However, if we change `x` to 5, and re-run the code, the condition will be FALSE, the code is not executed, and the printout will not occur. This brings us to the `else` statement: this conditional statement does not need an explicit condition; instead, it has to be used together with an if statement. The code associated with an else statement gets executed whenever the condition of the `if` test is not satisfied. We can extend our recipe by including an else statement as follows: Returning to our example, suppose we want to print out "x is positive or zero", whenever the condition is not met. We can simply add the else statement: The else if statement comes in between the if and else statement. To see how R deals with these different conditions and corresponding code blocks, let's first extend our example. We want R to print out "x is zero" if `x` equals 0 and to print out "x is a positive number" otherwise. We add the else if, together with a new print statement, and adapt the message we print on the else statement: How does R process this control structure? Let's first go through what happens when x equals -3. In this case, the condition for the `if` statement evaluates to `TRUE`, so "x is a negative number" gets printed out, and R ignores the rest of the statements. If x equals 0, R will first check the if condition, sees that it is FALSE, and will then head over to the else if condition. This condition, `x == 0`, evaluates to `TRUE`, so "x is zero" gets printed to the console, and R ignores the else statement entirely. Finally, what happens when x equals 5? Well, the `if` condition evaluates to `FALSE`, so does the `else if` condition, so R executes the else statement, printing "x is a positive number". Remember that as soon as R stumbles upon a condition that evaluates to `TRUE`, R executes the corresponding code and then ignores the rest of the control structure. This becomes important if the conditions you list are not mutually exclusive.
Views: 27633 DataCamp
Make sure to Like & Comment if you want more of these videos! First video of our first chapter for our Supervised Learning with scikit-learn course by Andreas Mueller. First chapter free: https://www.datacamp.com/courses/supervised-learning-with-scikit-learn At the end of the day, the value of Data Scientists rests on their ability to describe the world and to make predictions. Machine Learning is the field of teaching machines and computers to learn from existing data to make predictions on new data - will a given tumor be benign or malignant? Which of your customers will take their business elsewhere? Is a particular email spam or not? In this course, you'll learn how to use Python to perform supervised learning, an essential component of Machine Learning. You'll learn how to build predictive models, how to tune their parameters and how to tell how well they will perform on unseen data, all the while using real world datasets. You'll do so using scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.
Views: 11903 DataCamp
Learn more about cleaning data with R: https://www.datacamp.com/courses/cleaning-data-in-r The first step in the data cleaning process is exploring your raw data. We can think of data exploration itself as a three step process consisting of understanding the structure of your data, looking at your data, and visualizing your data. To understand the structure of your data, you have several tools at your disposal in R. Here, we read in a simple dataset called lunch, which contains information on the number of free, reduced price, and full price school lunches served in the US from 1969 through 2014. First, we check the class of the lunch object to verify that it's a data frame, or a two-dimensional table consisting of rows and columns, of which each column is a single data type such as numeric, character, etc. We then view the dimensions of the dataset with the dim() function. This particular dataset has 46 rows and 7 columns. dim() always displays the number of rows first, followed by the number of columns. Next, we take a look at the column names of lunch with the names() function. Each of the 7 columns has a name: year, avg_free, avg_reduced, and so on. Okay, so we're starting to get a feel for things, but let's dig deeper. The str() (for "structure") function is one of the most versatile and useful functions in the R language because it can be called on any object and will normally provide a useful and compact summary of its internal structure. When passed a data frame, as in this case, str() tells us how many rows and columns we have. Actually, the function refers to rows as observations and columns as variables, which, strictly speaking, is true in a tidy dataset, but not always the case as you'll see in the next chapter. In addition, you see the name of each column, followed by its data type and a preview of the data contained in it. The lunch dataset happens to be entirely integers and numerics. We'll have a closer look at these datatypes in chapter 3. The dplyr package offers a slightly different flavor of str() called glimpse(), which offers the same information, but attempts to preview as much of each column as will fit neatly on your screen. So here, we first load dplyr with the library() command, then call glimpse() with a single argument, lunch. Another extremely helpful function is summary(), which, when applied to a data frame, provides a useful summary of each column. Since the lunch data are entirely integers and numerics, we see a summary of the distribution of each column including the minimum and maximum, the mean, and the 25th, 50th, and 75th percent quartiles (also referred to as the first quartile, median, and third quartile, respectively.) As you'll soon see, when faced with character or factor variables, summary() will produce different summaries. To review, you've seen how we can use the class() function to see the class of a dataset, the dim() function to view its dimensions, names() to see the column names, str() to view its structure, glimpse() to do the same in a slightly enhanced format, and summary() to see a helpful summary of each column. Time to practice!
Views: 14966 DataCamp
In this introduction to R course you will learn about the basics of R, as well as the most common data structures it uses to store data Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r Your R skills are growing by the minute, I can feel it! The most important data structures that we've covered up to now are vectors and matrices. Remember that the vector is a one dimensional array that can only hold elements of the same type. Similarly, matrices can only hold elements of the same type, but this time they're in a two-dimensional array. Vectors and matrices are great, but there are cases in which you want to store different data types in the same data structures. This is where lists come in. A list can contain all kinds of R objects, such as vectors and matrices, but also other R objects, such as dates, data frames, factors and many more. All of this can be stored in a single list without R having to perform coercion to enforce the same type. That's pretty cool right? Because lists can contain practically anything you can think of in R terms, you do lose some functionality that vectors and matrices offered. Most importantly, performing calculus with lists is far less straightforward because there's no predefined structure that lists have to follow. Enough for the theory, let's build some lists! Suppose that as a music artist on the rise to fame and fortune, you regularly record some new songs, and keep some details for each of these songs. Your latest creation is called "Rsome times", is 190 seconds long and should be the 5th number on your record. Trying to store this information in a vector using the `c()` function, ..., Inevitably leads to coercion. However, you can also store this information in a list, using the `list()` function This time, all the elements have kept their original type. The printout is pretty different from what you're used to. We can see that the first element in the list is the string "Rsome times" for example, which actually is a character vector. To continue working with this list we'll store the list in a new variable, `song`: We can assert that this `song` variable is a list using the is-dot-list function: Now, storing the song information like this, without any names, is not really clear, so let's assign some labels with the names() function. To assign the names, you still use a character vector, even though we're working with lists now: Printing `song` again shows that the indices in double square brackets have changed to the names of the list elements, this looks much nicer! As was the case with vectors, you can also directly specify the names in a list at the time of creation. To create the exact same variable `songs`, you can use this command You've already figured out that the standard way of printing the contents of a list is pretty bulky. I suggest you use the `str()` function for this: this function compactly displays the structure of the `song` list: As I told you before, lists can contain practically anything. They can even contain other lists! Suppose you want to add a list containing the title and duration of a very similar but less catchy song that you've also recorded in the last weeks. Let's first create a list to contain this information, `similar_song` We can now create the `song` list again, as follows: The structure of this list reveals that it is perfectly possible to store lists inside lists: If you want to go totally crazy, you can even store a list inside a list and then store that list in another list, but let's not make ourselves dizzy here. It's time to work your way around some of the interactive exercises before I introduce some techniques to subset and extend lists. Have fun!
Views: 26474 DataCamp
Learn the basics of Machine Learning with R. Start our Machine Learning Course for free: https://www.datacamp.com/courses/introduction-to-machine-learning-with-R First up is Classification. A *classification problem* involves predicting whether a given observation belongs to one of two or more categories. The simplest case of classification is called binary classification. It has to decide between two categories, or classes. Remember how I compared machine learning to the estimation of a function? Well, based on earlier observations of how the input maps to the output, classification tries to estimate a classifier that can generate an output for an arbitrary input, the observations. We say that the classifier labels an unseen example with a class. The possible applications of classification are very broad. For example, after a set of clinical examinations that relate vital signals to a disease, you could predict whether a new patient with an unseen set of vital signals suffers that disease and needs further treatment. Another totally different example is classifying a set of animal images into cats, dogs and horses, given that you have trained your model on a bunch of images for which you know what animal they depict. Can you think of a possible classification problem yourself? What's important here is that first off, the output is qualitative, and second, that the classes to which new observations can belong, are known beforehand. In the first example I mentioned, the classes are "sick" and "not sick". In the second examples, the classes are "cat", "dog" and "horse". In chapter 3 we will do a deeper analysis of classification and you'll get to work with some fancy classifiers! Moving on ... A **Regression problem** is a kind of Machine Learning problem that tries to predict a continuous or quantitative value for an input, based on previous information. The input variables, are called the predictors and the output the response. In some sense, regression is pretty similar to classification. You're also trying to estimate a function that maps input to output based on earlier observations, but this time you're trying to estimate an actual value, not just the class of an observation. Do you remember the example from last video, there we had a dataset on a group of people's height and weight. A valid question could be: is there a linear relationship between these two? That is, will a change in height correlate linearly with a change in weight, if so can you describe it and if we know the weight, can you predict the height of a new person given their weight ? These questions can be answered with linear regression! Together, \beta_0 and \beta_1 are known as the model coefficients or parameters. As soon as you know the coefficients beta 0 and beta 1 the function is able to convert any new input to output. This means that solving your machine learning problem is actually finding good values for beta 0 and beta 1. These are estimated based on previous input to output observations. I will not go into details on how to compute these coefficients, the function `lm()` does this for you in R. Now, I hear you asking: what can regression be useful for apart from some silly weight and height problems? Well, there are many different applications of regression, going from modeling credit scores based on past payements, finding the trend in your youtube subscriptions over time, or even estimating your chances of landing a job at your favorite company based on your college grades. All these problems have two things in common. First off, the response, or the thing you're trying to predict, is always quantitative. Second, you will always need input knowledge of previous input-output observations, in order to build your model. The fourth chapter of this course will be devoted to a more comprehensive overview of regression. Soooo.. Classification: check. Regression: check. Last but not least, there is clustering. In clustering, you're trying to group objects that are similar, while making sure the clusters themselves are dissimilar. You can think of it as classification, but without saying to which classes the observations have to belong or how many classes there are. Take the animal photo's for example. In the case of classification, you had information about the actual animals that were depicted. In the case of clustering, you don't know what animals are depicted, you would simply get a set of pictures. The clustering algorithm then simply groups similar photos in clusters. You could say that clustering is different in the sense that you don't need any knowledge about the labels. Moreover, there is no right or wrong in clustering. Different clusterings can reveal different and useful information about your objects. This makes it quite different from both classification and regression, where there always is a notion of prior expectation or knowledge of the result.
Views: 37171 DataCamp
Learn more about connecting to databases with R: https://www.datacamp.com/courses/importing-data-in-r-part-2 Welcome to part two of importing data in R! The previous course dealt with accessing data stored in flat files or excel files. In a professional setting, you'll also encounter data stored in relational databases. In this video, I'll briefly talk about what a relational database is and then I'll explain how you can connect to it. In the next video, I'll explain how you can import data from it! So, what's a relational database? There's no better way to show this than with an example. Take this database, called company. It contains three tables, employees, products and sales. Like a flat file, information is displayed in a table format. The employees table has 5 records and three fields, namely id, name and started_at. The id here serves as a unique key for each row or record. Next, the products table contains the details on four products. We're dealing with data from a telecom company that's selling both with and without a contract. Also here, each product has an identifier. Finally, there's the sales table. It lists what products were sold by who, when and for what price. Notice here that the ids in employee_id and product_id correspond to the ids that you can find in the employees and products table respectively. The third sale for example, was done by the employee with id 6, so Julie. She sold the product with id 9, so the Biz Unlimited contract. These relations make this database very powerful. You only store all necessary information once in nicely separated tables, but can connect the dots between different records to model what's happening. How the data in a relational database is stored and shuffled around when you make adaptations, depends on the so-called database management system, or DBMS you're using. Open-source implementations such as MySQL, postgreSQL and SQLite are very popular, but there are also proprietary implementations such as Oracle Database and Microsoft SQL server. Practically all of these implementations use SQL, or sequel, as the language for querying and maintaining the database. SQL stands for Structured Query Language. Depending on the type of database you want to connect to, you'll have to use different packages. Suppose the company database I introduced before is a MySQL database. This means you'll need the RMySQL package. For postgreSQL you'll need RpostgreSQL, for Oracle, you'll use ROracle and so on. How you interact with the database, so which R functions you use to access and manipulate the database, is specified in another R package called DBI. In more technical terms, DBI is an interface, and RMySQL is the implementation. Let's install the RMySQL package, which automatically installs the DBI package as well. Loading only the DBI package will be enough to get started. The first step is creating a connection to the remote MySQL database. You do this with dbConnect(), as follows. The first argument specifies the driver that you will use to connect to the MySQL database. It sure looks a bit strange, but the MySQL() function from the RMySQL package simply constructs a driver for us that dbConnect can use. Next, you have to specify the database name, where the database is hosted, through which port you want to connect, and finally the credentials to authenticate yourself. This is an actual database that we're hosting, so you can try these commands yourself! The result of the dbConnect call, con, is a DBI connection object. You'll need to pass this object to whatever function you're using to interact with the database. Before we do that, let's get familiar with this connection object in the exercises!
Views: 39845 DataCamp
Explore how you can subset, extend and sort your data frames in R. Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r The data frame is somewhere on the intersection between matrices and lists. To subset a dataframe you can thus use subsetting syntax from both matrices and lists. On the one hand, you can use the single brackets from matrix subsetting, while you can also use the double brackets and dollar sign notation that you use to select list elements. We'll continue with the data frame that contained some information on 5 persons. Have another look at its definition here. Let's start with selecting single elements from a data frame. To select the age of Frank, who is on row 3 in the data frame, you can use the exact same syntax as for matrix subsetting: single brackets with two indices inside. The row, index 3, comes first, and the column, index 2, comes second: Indeed, Frank is 21 years old. You can also use the column names to refer to the columns of course: Just as for matrices, you can also choose to omit one of the two indices or names, to end up with an entire row or an entire column. If you want to have all information on Frank, you can use this command: The result is a data frame with a single observation, because there has to be a way to store the different types. On the other hand, to get the entire age column, you could use this command: Here, the result is a vector, because columns contain elements of the same type. Subsetting the data frame to end up with a sub data frame that contains multiple observations also works just as you'd expect. Have a look at this command, that selects the age and parenting information on Frank and Cath: All of these examples show that you can subset data frames exactly as you did with matrices. The only difference occurs when you specify only one index inside `people`. In the matrix case, R would go through each column from left to right to find the index you specified. In the data frame case, you simply end up with a new data frame, that only contains the column you specified. This command, for example, gives the age column as a data.frame. I repeat: a data.frame, not a vector! Why so? Let me talk about subsetting data.frames with list syntax and it'll all become clear. Remember when I told that a data frame is actually a list containing all vectors of the same length? This means that you can also use the list syntax to select elements. Say, for example, you typed people dollar sign age: The age vector inside the data frame gets returned, so you end up with the age column. Likewise, you can use the double brackets notation with a name ... or with an index. In all cases, the result is a vector. You can also use single brackets to subset lists, but this generates a new list, containing only the specified elements. Take this command for example: The result is still a data frame, which is a list, but this time containing only the "age" element. This explains why before, this command gave a data frame instead of vector. Again, using single brackets or double brackets to subset data structures can have serious consequences, so always think about what you're dealing with and how you should handle it. Once you know how to correctly subset data frames, extending those data frames is pretty simple. Sometimes, you'll want to add a column, a new variable, to your data frame. Other times, it's also useful to add new rows, so new observations, to your data frame. To add a column, which actually comes down to adding a new element to the list, you can use the dollar sign or the double square brackets. Suppose you want to add a column `height`, the information of which is already in a vector `height`. This call ... Or this call ... Will do the trick. You can also use the `cbind()` function that you've learned to build and extend matrices. It works just the same for data.frames. To add a weight column, in kilograms, for example. If `cbind()` works, than surely `rbind()` will work fine as well. Indeed, you can use `rbind()` to add new rows to your observations. Suppose you want to add the information of another person, Tom, to the data frame. Simply creating a vector with the name, age, height etc, won't work, because a vector can't contain elements of different types. You'll have to create a new data frame containing only a single observation, and add that to the data frame using rbind. Let's call this mini data frame `tom`. Now, we can use `rbind()` to bind `people` and `tom` together: Wait, what? R throws an error. Names do not match previous names. This means that the names in `people` and `tom` do not match. We'll have to improve our definition of `tom` to make the merge successful: Now, `rbind()` will work as you'd want it to work. So adding a column to a data frame is pretty easy, but adding new observations requires some care.
Views: 68140 DataCamp
With so many people making resolutions for the new year, we thought it was only right for DataCamp to make a few resolutions of its own. Jonathan Cornelissen, one of the founders and CEO of DataCamp, will tell you more about DataCamp's New Year's resolutions for 2018. Start learning today: https://www.datacamp.com
Views: 18819 DataCamp
This is the first video of chapter 1 of Network Analysis by Eric Ma. Take Eric's course: https://www.datacamp.com/courses/network-analysis-in-python-part-1 From online social networks such as Facebook and Twitter to transportation networks such as bike sharing systems, networks are everywhere, and knowing how to analyze this type of data will open up a new world of possibilities for you as a Data Scientist. This course will equip you with the skills to analyze, visualize, and make sense of networks. You'll apply the concepts you learn to real-world network data using the powerful NetworkX library. With the knowledge gained in this course, you'll develop your network thinking skills and be able to start looking at your data with a fresh perspective! Transcript: Hi! My name is Eric, and I am a Data Scientist working at the intersection of biological network science and infectious disease, and I'm thrilled to share with you my knowledge on how to do network analytics. I hope we'll have a fun time together! Let me first ask you a question: what are some examples of networks? Well, one example might be a social network! In a social network, we are modelling the relationships between people. Here’s another one - transportation networks. In a transportation network, we are modelling the connectivity between locations, as determined by roads or flight paths connecting them. At its core, networks are a useful tool for modelling relationships between entities. By modelling your data as a network, you can end up gaining insight into what entities (or nodes) are important, such as broadcasters or influencers in a social network. Additionally, you can start to think about optimizing transportation between cities. Finally, you can leverage the network structure to find communities in the network. Let’s go a bit more technical. Networks are described by two sets of items: nodes and edges. Together, these form a “network”, otherwise known in mathematical terms as a “graph”. Nodes and edges can have metadata associated with them. For example, let’s say there are two friends, Hugo and myself, who met on the 21st of May, 2016. In this case, the nodes may be “Hugo” and myself, with metadata stored in a key-value pair as “id” and “age”. The friendship is represented as a line between the two nodes, and may have metadata such as “date”, which represents the date on which we first met. In the Python world, there is a library called NetworkX that allows us to manipulate, analyze and model graph data. Let’s see how we can use the NetworkX API to analyze graph data in memory. NetworkX is typically imported as nx. Using nx.Graph(), we can initialize an empty graph to which we can add nodes and edges. I can add in the integers 1, 2, and 3 as nodes, using the add_nodes_from() method, passing in the list [1, 2, 3] as an argument. The Graph object G has a .nodes() method that allows us to see what nodes are present in the graph, and returns a list of nodes. If we add an edge between the nodes 1 and 2, we can then use the G.edges() method to return a list of tuples which represent the edges, in which each tuple shows the nodes that are present on that edge. Metadata can be stored on the graph as well. For example, I can add to the node ‘1’ a ‘label’ key with the value ‘blue’, just as I would assign a value to the key of a dictionary. I can then retrieve the node list with the metadata attached using G.nodes(), passing in the data=True argument. What this returns is a list of 2-tuples, in which the first element of each tuple is the node, and the second element is a dictionary in which the key-value pairs correspond to my metadata. NetworkX also provides basic drawing functionality, using the nx.draw() function. nx.draw() takes in a graph G as an argument. In the IPython shell, you will also have to call the plt.show() function in order to display the graph to screen. With this graph, the nx.draw() function will draw to screen what we call a node-link diagram rendering of the graph. The first set of exercises we’ll be doing here is essentially exploratory data analysis on graphs. Alright, let’s go on and take a look at the exercises! https://www.datacamp.com/courses/network-analysis-in-python-part-1
Views: 21214 DataCamp
Learn more about cleaning data in R: https://www.datacamp.com/courses/cleaning-data-in-r Okay, so we've seen some useful summaries of our data, but there's no substitute for just looking at it. The head() function shows us the first 6 rows by default. If you add one additional argument, n, you can control how many rows to display. For example, head(lunch, n = 15) will display the first 15 rows of the data. We can also view the bottom of lunch with the tail() function, which displays the last 6 rows by default, but that behavior can be altered in the same way with the n argument. Viewing the top and bottom of your data only gets you so far. Sometimes the easiest way to identify issues with the data are to plot them. Here, we use hist() to plot a histogram of the percent free and reduced lunch column, which quickly gives us a sense of the distribution of this variable. It looks like the value of this variable falls between 50 and 60 for 20 out of the 46 years contained in the lunch dataset. Finally, we can produce a scatter plot with the plot() function to look at the relationship between two variables. In this case, we clearly see that the percent of lunches that are either free or reduced price has been steadily rising over the years, going from roughly 15 to 70 percent between 1969 and 2014. To review, head() and tail() can be used to view the top and bottom of your data, respectively. Of course, you can also just print() your data to the console, which may be okay when working with small datasets like lunch, but is definitely not recommended when working with larger datasets. Lastly, hist() will show you a histogram of a single variable and plot() can be used to produce a scatter plot showing the relationship between two variables. Time to practice!
Views: 10090 DataCamp
Introduction on what exactly a data.table is, how it differs from the traditional data.frame in R. Start the interactive data.table course by DataCamp for free at https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial The data.table package is rapidly making its name as the number one choice for handling large datasets in R. This course will bring you from data.table novice to data.table expert.
Views: 9663 DataCamp
Learn more about Multiple Groups and Variables in ggplot2: https://www.datacamp.com/courses/data-visualization-with-ggplot2-part-3 To wrap up our discussion of statistical plots, let’s see how we can use them when comparing multiple groups or variables. Let’s begin with groups, by which I mean levels within a factor variable - in this case, it’s the eating habits of different mammals. The distribution that we are interested in is the amount of total sleep time experienced by each mammal. Up until this point we would have used a plot like this, which is just jittered points. But we’ve seen that we can also use box plots. So that’s pretty straightforward. I should point out that although we could use box plots, in this case it’s not really reasonable, since the insectivore group only has 5 observations. A problem with box plots, is that they don’t show information about the number of observations. We can remedy this problem by setting the width of each box relative to the n value for each group. Density plots could work in this situation. The advantage here is that we can overlay multiple density plots on top of each other, so we can compare distributions more easily, which is pretty nice. However, we once again lose information about the group size, since it appears that insectivores, the blue curve is very abundant. To correct for this we can weight each density curve according to the proportional number of observations of each group. The resulting plot shows that Herbivores are the most abundant group, and there are very few observations in insectivores. If we wanted to see multiple density plots side-by-side, we could facet our plot, but there is another alternative. The violin plot is a relatively new plot type which is gaining in popularity. The violin plot basically puts a density plot onto a vertical axis and then mirrors it to create a symmetrical two-dimensional shape. This can really aid in comparing different distributions. Just like with the regular density plot, we should also consider weighting each group according to its n value. We once again see that insectivores are not very abundant. With these plots we can compare many groups within a variable. The other type of comparison I mentioned was to compare separate variables. For that let’s take a look at a classic example, the eruption time and waiting duration at the Old Faithful Geyser in Yellowstone National Park. At the outset it appears as if the main relationship between these two variables is linear, which would be correct, but more subtly than that, the data is also bimodal on both axes. That is, you either wait a long time and get a long eruption, or you wait a short time and get a short eruption. There are relatively few data points in-between. For this, we can use a 2d density plot, which appear as something like a contour plot. If you’ve ever seen a topographical map, the concept is the same. The more concentric a ring is, the higher the density. A nice effect here is to fill in the regions according to their density. We encountered monochromatic colour scales in fir first two courses, which I advocated for in the case of continuous data. However, the viridis colour scale has recently gained in popularity and we'll explore some advantage of this scale in the exercises. A two-dimensional density plot emphasises the bimodal nature of this data set, so sometimes it can be quite useful to consider distributions in two dimensions. We’ll see density plots make a reappearance when we talk about ternary plots in the next chapter, where we have three variables. Another advantage of the ggplot2 structure, is that we can use the underlying statistics with a different geom, so instead of producing a contour or filled density plot, we can calculate the density by representing the values using a gird of circles, whose size varies according to the underlying density. ok, that's enough discussion for now, let's take a closer look at 2-dimensional density plots in the exercises!
Views: 24625 DataCamp
Learn more about text mining with R: https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words Now that you have a corpus, you have to take it from the unorganized raw state and start to clean it up. We will focus on some common preprocessing functions. But before we actually apply them to the corpus, let’s learn what each one does because you don’t always apply the same ones for all your analyses. Base R has a function tolower. It makes all the characters in a string lowercase. This is helpful for term aggregation but can be harmful if you are trying to identify proper nouns like cities. The removePunctuation function...well it removes punctuation. This can be especially helpful in social media but can be harmful if you are trying to find emoticons made of punctuation marks like a smiley face. Depending on your analysis you may want to remove numbers. Obviously don’t do this if you are trying to text mine quantities or currency amounts but removeNumbers may be useful sometimes. The stripWhitespace function is also very useful. Sometimes text has extra tabbed whitespace or extra lines. This simply removes it. A very important function from tm is removeWords. You can probably guess that a lot of words like "the" and "of" are not very interesting, so may need to be removed. All of these transformations are applied to the corpus using the tm_map function. This text mining function is an interface to transform your corpus through a mapping to the corpus content. You see here the tm_map takes a corpus, then one of the preprocessing functions like removeNumbers or removePunctuation to transform the corpus. If the transforming function is not from the tm library it has to be wrapped in the content_transformer function. Doing this tells tm_map to import the function and use it on the content of the corpus. The stemDocument function uses an algorithm to segment words to their base. In this example, you can see "complicatedly", "complicated" and "complication" all get stemmed to "complic". This definitely helps aggregate terms. The problem is that you are often left with tokens that are not words! So you have to take an additional step to complete the base tokens. The stemCompletion function takes as arguments the stemmed words and a dictionary of complete words. In this example, the dictionary is only "complicate", but you can see how all three words were unified to "complicate". You can even use a corpus as your completion dictionary as shown here. There is another whole group of preprocessing functions from the qdap package which can complement these nicely. In the exercises, you will have the opportunity to work with both tm and qdap preprocessing functions, then apply them to a corpus.
Views: 17971 DataCamp
Learn how to connect to your database: https://www.datacamp.com/courses/introduction-to-relational-databases-in-python In the python world, there are several great tools that we can use when working with databases. One of those is SQLAlchemy that we will be using throughout this course. SQLAlchemy will allow us to generate SQL queries by writing Python code. You should still consider learning how to write queries in SQL as well. SQLAlchemy has two main components. The part we will be focusing on is often referred to as "core" part of SQLAlchemy. It's really focused around the relational model of the database. Additionally, there is the Object Relational Model or ORM part of SQLAlchemy that is really focused around data models and classes that you as a programmer create. There are many different types of databases, and each database type has its own quirks and unique capabilities. You'll commonly find SQLite, PostgreSQL, MySQL, Microsoft SQL Server, and Oracle when working with data. SQLAlchemy provides a way to operate across all of these database types in a consistent manner. To connect to a database, we need a way to talk to it, and an engine provides that common interface. To create an engine, we import the create_engine function from sqlalchemy; we then use the create_engine function and supply it a connection string that provides the details needed to connect to a database. Finally once we have an engine, we are ready to make a connection using the connect method on the engine. It's worth noting that SQLAlchemy won't actually make the connection until we give it some work to execute. So to review, an engine is the common interface to the database, which requires a connection string to provide the details used to find and connect to the database. Before we go any further, let's talk a bit more about connection strings. In their simplest form, they tell us what kind of database we are talking to and how we should access it. In this example, you can see that we are using the sqlite database driver and the database file named census_nyc.sqlite which is in the current directory. Now that we have an engine and a connection, we need to know what tables are in the database. We'll start again by importing the create_engine function and creating an engine to our database. Finally, We can use the table_names method of the engine which returns a list of tables. Once we know what table we want to work on, we need a way to access that table with python. To do that we are going to use a handy process called reflection, which reads the database and builds a Table object that we can use in our code. We already have created our engine, so we begin by importing the MetaData and Table objects needed for reflection. The MetaData object is a catalog that stores database information such as tables so we don't have to keep looking them up. To reflect the table, we initialize a MetaData object. Next, we use the SQLAlchemy Table object and provide the table name we got earlier from the table_names method. We also supply our metadata instance, and then instruct it to autoload the table using the engine. Finally, we can use the function repr to view the details of our table that we stored as census. This allows us to see the names of the columns, such as 'state' and 'sex', along with their types, such as VARCHAR and INTEGER. This process of reflection may seem a bit of an overhead, but it will make understanding your databases and extracting information from them far easier downstream. Now it's your turn to practice writing connection strings, connecting to databases and reflecting tables. Then we'll be back here writing our first SQL queries.
Views: 33871 DataCamp
Learn more about joins and data manipulation: https://www.datacamp.com/courses/joining-data-in-r-with-dplyr The basic join function in dplyr is left_join(). You can use it whenever you want to augment a data frame with information from another data frame. To use left_join(), first pass it the name of the data frame that you want to augment. This will be your primary table. Here we will augment name by adding to it information from plays. Next, pass left_join() the name of the data frame that you want to augment the first with. This will be the secondary table. In our case, that will be plays. Finally, give left_join() a by argument. You should set by to the name of the key to join on, as a character string. Our key is "name", surrounded by quotation marks. When you click run, left join will return a new data frame that contains all of the rows in the first data frame in their original order. Appended to these rows will be new values and columns matched from the second data frame. If a row in the first data frame doesn't have a match, dplyr will supply an NA in the appropriate place, as it did for Mick. If a row in the second data frame doesn't have a match, dplyr will ignore it entirely. It won't appear in the new data frame, as with Keith. What if your key involves more than one variable? For example, what if we want to join our surname data sets. That's easy. To use multi-column key, pass by a character vector that includes all of the column names in the key. Here we can set by to the character vector "name", "surname". Altogether, dplyr contains six join functions. Each uses the same syntax as left_join() but returns a slightly different result. For example, right_join() is another dplyr function that does the exact opposite of left_join(). right_join() treats the second data set as the primary data set. As a result, it returns a data frame that contains all of the rows of the second data set, augmented where appropriate with information from the first data set. Here, Keith is retained because he is in the second data set, but Mick is not because he is in the first. In the next video, we will look at more variations on joins, but first we will practice with the basic syntax. Before we do though, I'd like to clarify a term that will come up throughout the course. Dplyr's two table functions are designed to work with data frames, which are the basic sturucture for storing multi-type tabular data in R. However, dplyr's functions also work with tibbles and dplyr connection objects. You can pass each of these structures into a dplyr function, and it will return the same type of structure as the result. A tibble is a data frame that has been enhanced with an extra class. Tibbles behave like data frames in everyway except one. When you display a tibble at the command line, R will show you only the portion of the tibble that fits in your console window. For example, here is how R displays the contents of mtcars, which is a data frame. And here's how R displays a tibble version of the same data set. The idea behind tibbles is that it is easier to inspect a small portion of a data set than to inspect the entire data set at once. You can always see the entire tibble with R's View() command. As you will see, many dplyr functions return tibbles. You can think of them as data frames. To learn more about tibbles, check out the tibble package. A dplyr connection object is a dplyr object that references a table stored outside of R. We don't need to worry about them here. I just want to make it clear that when I talk about dplyr functions working with data frames, data tables, or data sets, I mean all three of these objects.
Views: 6161 DataCamp
Learn how to make custom plots in Python with matplotlib: https://datacamp.com/courses/intermediate-python-for-data-science Creating a plot is one thing. Making the correct plot, that makes the message very clear, is the real challenge. For each visualization, you have many options. First of all, there are the different plot types. And for each plot, you can do an infinite number of customizations. You can change colors, shapes, labels, axes, and so on. The choice depends on, one, the data, and two, the story you want to tell with this data. Since there are a so many possible customizations, the best way to learn this, is by example. Let's start with the code in this script to build a simple line plot. It's similar to the line plot we've created in the first video, but this time the year and pop lists contain more data, including projections until the year 2100, forecasted by the United Nations. If we run this script, we already get a pretty nice plot: it shows that the population explosion that's going on, will have slowed down by the end of the century. But some things can be improved. First, it should be clearer which data we are displaying, especially to people who are seeing the graph for the first time. And second, the plot really needs to draw the attention to the population explosion. The first thing you always need to do is label your axes. Let's do this by adding the xlabel() and ylabel() functions. As inputs, we pass strings that should be placed alongside the axes. Make sure to call these functions before calling the show() method, otherwise your customizations will not be displayed. If we run the script again, this time the axes are annotated. We're also going to add a title to our plot, with the title function. We pass the actual title, 'World Population Projections', as an argument. And there's the title! So, using xlabel, ylabel and title, we can give the reader more information about the data on the plot: now they can at least tell what the plot is about. To put the population growth in perspective, I want to have the y-axis start from zero. You can do this with the yticks() function. The first input is a list, in this example with the numbers zero up to ten, with intervals of 2. If we run this, the plot will change: the curve shifts up. Now it's clear that already in 1950, there were already about 2.5 billion people on this planet. Next, to make it clear we're talking about billions, we can add a second argument to the yticks function, which is a list with the display names of the ticks. This list should have the same length as the first list. The tick 0 gets the name 0, the tick 2 gets the name 2B, the tick 4 gets the name 4B and so on. By the way, B stands for Billions here. If we run this version of the script, the labels will change accordingly, great. Finally, let's add some more historical data to accentuate the population explosion in the last 60 years. On wikipedia, I found the world population data for the years 1800, 1850 and 1900. I can write them in list form and append them to the pop and year lists with the plus sign. If I now run the script once more, three datapoints are added to the graph, giving a more complete picture. Now that's how you turn an average line plot into a visual that has a clear story to tell! Over to you now. Head over to the exercises, gradually customize the world development chart and become the next Hans Rosling!
Views: 22572 DataCamp
Learn more about machine learning with R: https://www.datacamp.com/courses/machine-learning-toolbox Hi! I'm Zach Deane Mayer, and I'm one of the co-authors of the caret package. I have a passion for data science, and spend most of my time working on and thinking about problems in machine learning. This course focuses on predictive, rather than explanatory modeling. We want models that do not overfit the training data and generalize well. In other words, our primary concern when modeling is "do the models perform well on new data?" The best way to answer this question is to test the models on new data. This simulates real world experience, in which you fit on one dataset, and then predict on new data, where you do not actually know the outcome. Simulating this experience with a train/test split helps you make an honest assessment of yourself as a modeler. This is one of the key insights of machine learning: error metrics should be computed on new data, because in-sample validation (or predicting on your training data) essentially guarantees overfitting. Out-of-sample validation helps you choose models that will continue to perform well in the future. This is the primary goal of the caret package in general and this course specifically: don’t overfit. Pick models that perform well on new data. Let's walk through a simple example of out-of-sample validation: We start with a linear regression model, fit on the first 20 rows of the mtcars dataset. Next, we make predictions with this model on a NEW dataset: the last 12 observations of the mtcars dataset. The 12 cars in this test set will not be used to determine the coefficents of the linear regression model, and are therefore a good test of how well we can predict on new data. In practice, rather than manually splitting the dataset, we'd actually use the createResamples or createFolds function in caret, but the manual split simplifies this example. Finally, we calculate root-mean-squared-error (or RMSE) on the test set by comparing the predictions from our model to the actual MPG values for the test set. RMSE is a measure of the model's average error. It has the same units as the test set, so this means our model is off by 5 to 6 miles per gallon, on average. Compared to in-sample RMSE from a model fit on the full dataset, our model is signifigantly worse. If we had used in-sample error, we would have fooled ourselves into thinking our model is much better than it actually is in reality. It's hard to make predictions on new data, as this example shows. Out-of-sample error helps account for this fact, so we can focus on models that predict things we don't already know. Let's practice this concept on some example data.
Views: 6468 DataCamp
Course introduction video to Object-Oriented Programming in R: S3 & R6 by Richie Cotton. Learn more about the course here: https://www.datacamp.com/courses/object-oriented-programming-in-r-s3-and-r6 Object-oriented programming (OOP) lets you specify relationships between functions and the objects that they can act on, helping you manage complexity in your code. This is an intermediate level course, providing an introduction to OOP, using the S3 and R6 systems. S3 is a great day-to-day R programming tool that simplifies some of the functions that you write. R6 is especially useful for industry-specific analyses, working with web APIs, and building GUIs. The course concludes with an interview with Winston Chang, creator of the R6 package.
Views: 3091 DataCamp
Learn more about scoping with R: https://www.datacamp.com/courses/writing-functions-in-r Scoping describes how R looks up values when given a name. It's important to understand so that you can reason about functions without running them. If I assign the value 10 to name x, then ask for x, scoping describes the process that R uses to find the value 10. Inside functions, scoping works as you might expect. When this function is called, the function begins execution in a new working environment. In this new environment, x and y are defined, then put in a vector and returned. Unsurprisingly, the return value is the vector c(1, 2) If a variable referred to inside a function doesn't exist in the function's current environment, it looks in the environment one level up. In this function g(), y is defined, but x is not. When g is called, and execution reaches the line: combine x and y in a vector, y is found locally, and takes the value 1, x is not found locally, so R looks for it in the environment up one level, the global environment in this case, and finds the value 2. If x didn't exist in the global environment the function will return an error, since x isn't defined locally, or at any higher level. Scoping describes only where, not when, to look for a variable. This means it's possible the return value of a function could depend on when you call it. For example, here depending on the state of our global environment: whether x has the value 15 or 20, the call to f() with no arguments returns different values. This is really undesirable behavior for a function, because if we look at a call to f() in isolation, we don't know what it will return. For this reason, when you are writing functions, they should never depend on variables other than the arguments. We'll talk more about this in Chapter 5 on robust functions. Lookup by name works exactly the same when the name refers to a function. Here, when the function m() is called, and it reaches the line, call l() with the value 10, R uses the l() function defined locally as x * 2 and returns 20. If it is obvious you are using the name like a function, R ignores any non-function objects when it looks it up. Here's a really tricky example: c is being used in three ways. First, as a function, and R correctly finds the c function that combines values into vector. Second, c is being used as a name, and finally c refers to a value, which R looks up and finds is 3. Every time a function is called, it gets a clean working environment. This means different calls to the same function are completely independent. Here's a function, j(), that if the object a doesn't exist it creates it and gives it the value 1. Otherwise, it gives it the value a + 1. Finally, it returns the current value of a. Regardless of how many times you call j(), it always returns 1, since each time it's called the working environment is empty, a is created in this working environment, but the environment disappears as soon as j() exits. This also means any local variables created in a function are never available in the global environment. In summary, when you call a function, a new working environment is created to conduct the execution of the functions body. This new environment is first populated with values of the arguments. As the function executes, values are looked for first in this working environment. If they aren't found they are looked for in the environment the function was created in.
Views: 8987 DataCamp
Welcome the Data Camp series on data visualisation with ggplot2! You can see see the full course at https://www.datacamp.com/courses/data-visualization-with-ggplot2-part-1 My name is Rick Scavetta and I'll be the instructor for this series. For the past four years I've been training scientists in a variety of sub-disciplines on how to better usderstand and use visualisations and I'm very excited to provide this course at data camp. So what is data vis? slide Data visualisation... slide is an essential component of your skill set as a data scientist. slide Data Vis is statistics and design combined in meaningful and appropriate ways. slide That means that on the one hand, data vis is a form of graphical data analysis, emphasising accurate representation and interpretation of data. slide But on the other hand, data vis relies on good design principles to not only make our plots attractive, but also meaningful. By meaningful I mean that good design aids in both the understanding and communication of results. On top of that, there is an element of creativity, since at it's heart data vis is a form of visual communication. It's important to understand the distinction between exploratory and explanatory plots. Exploratory visualisations are easily-generated, data-heavy and intended for a small specialist audience, for example yourself and your colleagues - Their primary purpose is graphical data analysis. Explanatory visualisations are labour-intensive, data-specific and intended for a broader audience, e.g. in publications or presentations - they are part of the communications process. As a data analyst, you're job involves exploring your data, but also explaining it to a specific audience. Good design begins with thinking about the audience - sometimes that just means ourselves. Let's go through a short example. Consider this data set containing the average brain and body weights of 62 land mammals from the MASS package. We're interested in looking at the relationship between these two continuous variables, so the most obvious first step is to make a scatter plot, like this one. So we begin to explore our data, which reveals an expected positive skew on both axes. This isn't surprising since there are two mammals, the African and Asian Elephants with both very large brain and body weights. We can extend our plot by applying a linear model, but given the nature of the data, you can probably already imagine that this is a pretty poor model because a few extreme values have a large influence. A log transformation of both variables allows for a better fit. Although we began with a rough exploratory plot, that informed us about our data and lead us to a meaninful result. Now we're ready to share our results as an explanatory plot. Don't worry if you don't understand all of this code at the moment, by the end of this series we'll have covered all the concepts used here. Another example of the usefulness of data visualisation as a data analysis tool is this classic example from Francis Anscombe, first published in 1973. When we imagine a linear model, as presented on this anonymous plot, we imagine that we are describing data that looks something like this... Which would be a fairly accurate representation. However, this same model could be describing a very different set of data... for example showing a parabolic relationship. Where this model would be much better suited to the data at hand. or it may be describing data in which an extreme value has a large effect. which becomes clear when the outlier is removed. And sometimes... the model may be describing a relationship where in fact there is none at all... ...because of some obscure extreme values, which may very well be false. Here four different data sets are described by the same linear model. If we relied solely on the numerical output without plotting our data, we'd have missed distinct and interesting underlying trends. These examples should give you an idea of what we set out to do with visualisations. Although it's clearly based in statistics and graphical data analysis, visualisaiton is a creative process that involves some amount of trial and error. slide In this series of courses we're going to see some familiar data sets, such as the classic iris data set, We're going to understand how to explore our data from many differnt perspectives and use visual tools like colour appropriatly. We'll also understand how the structure of our data helps us to make meaningful comparisons. We'll also use a variety of data sets built into R, such as the Vocab dataframe in the car package, to understand common pitfalls and best practices. such as what is the best plot type for accurately representing the nature of our data.
Views: 28051 DataCamp
Learn how create histograms with matplotlib: https://www.datacamp.com/courses/intermediate-python-for-data-science In this video, I'll introduce the histogram. The histogram is a type of visualization that's very useful to explore your data. It can help you to get an idea about the distribution of your variables. To see how it works, imagine 12 values between 0 and 6. I've put them along a number line here. To build a histogram for these values, you can divide the line into equal chunks, called bins. Suppose you go for 3 bins, that each have a width of 2. Next, you count how many data points sit inside each bin. There's 4 data points in the first bin, 6 in the second bin and 2 in the third bin. Finally, you draw a bar for each bin. The height of the bar corresponds to the number of data points that fall in this bin. The result is a histogram, which gives us a nice overview on how the 12 values are distributed. Most values are in the middle, but there are more values below 2 than there are values above 4. Of course, also matplotlib is able to build histograms. As before, you should start by importing the pyplot package that's inside matplotlib. Next, you can use the hist() function. Let's open up its documentation. There's a bunch of arguments you can specify, but the first two here are the most important ones. x should be a list of values you want to build a histogram for. You can use the second argument, bins, to tell Python in how many bins the data should be divided. Based on this number, hist() will automatically find appropriate boundaries for all bins, and calculate how may values are in each one. If you don't specify the bins argument, it will by 10 by default. So to generate the histogram that you've seen before, let's start by building a list with the 12 values. Next, you simply call hist() and pass this list as an input, so it's matched to the argument x. I also specified the bins argumet to be 3, so that the values are divided in three bins. If you finally call the show function, a nice histogram results. Histograms are really useful to give a bigger picture. As an example, have a look at this so-called population pyramid. The age distribution is shown, for both males and females, in the European union. Notice that the histograms are flipped 90 degrees; the bins are horizontal now. The bins are largest for the ages 40 to 44, where there are 20 million males and 20 million females. They are the so called baby boomers. These are figures of the year 2010. What do you think will have changed in 2050? Let's have a look. The distribution is flatter, and the baby boom generation has gotten older. With the blink of an eye, you can easily see how demographics will be changing over time. That's the true power of histograms at work here! Now head over to the exercises to experiment with histograms yourself!
Views: 14569 DataCamp
Learn more about ggplot2 layers in R: https://www.datacamp.com/courses/data-visualization-with-ggplot2-part-2 Now that we have some idea about the different grammatical elements of graphics, let's see how this works in practice. The grammar of graphic is implemented in R using the ggplot2 package, which was one of the first packages developed by the prolific statistician and R programmer Hadley Wickham. Essentially, we construct plots by layering grammatical elements on top of each other and use aesthetic mappings to define our visualisations. We are going to go through each grammatical element in depth in this and the next course. Here I'll introduce a data set which will be used throughout the videos and we'll go over some simple examples. The first layer is data. Obviously we need some data to plot. I'm going to use several different data sets in the course videos, one of which is the classic iris data set collected by Edgar Anderson in the 1930s and thereafter popularised by RA Fisher. The data set contains information on three iris species, setosa, virginica and versicolor. Four mearurements were taken from each plant - the petal length and with and the sepal length and width. You're probably familiar with petals, they're the colourful part of a flower. Sepals are the outter leaves of the flower, they are typicall green, but in this case they are colourful. There are 50 specimens of each species. The data is stored in an object called iris, there ar five variables: the species and one for each of the properties which were measured. The next layer is aesthetics, which tells us which scales we should map our data onto. This is where the second main component of the grammar of graphics comes into play. On top of layering the grammatical elements, it's here that we establish our aesthetic mappings. In this case we are going to make a scatter plot so we're going to map the Sepal.Length onto the X aesthetic and the Sepal.Width onto the Y aesthetic. The third essential layer is allows us to choose that geometry, that means how the plot will look. After we've established our three essential layers, we have enough instructions to make a basic scatter plot plot. It's pretty rough, so to get a more meaningful and cleaner visualisaiton, we'll have to use the other layers. The next layer we'll use is facets, which dictate how to split up our plot. In this case we want to make three separate plots one for each of three species under consideration. The statistics layer can be use to calculate and add many different parameters. For example, here we've chosen to add a linear model to each of the three subplots. Next comes the coordinate layer, which allows us to specify the precise dimensions of the plot. Here we've cleaned up the labelling and the scalling of both the x and x axes. And finally the theme layer controls all the non-data ink on our plot. Which allows us to get a nice looking, meaningful and publication quality plot directly in R. Let's explore these concepts further in the exercises.
Views: 10291 DataCamp
Understand how to customize your plots in R and make them more fancier. Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r The plots that I've shown to you in the previous video and the ones you've created yourself in the interactive exercises look quite nice, but there are certainly more things we can do to make the plots fancier and more informative. What about setting a title, or specifying the labels of the axes? All of this is possible from inside R. Throughout our experiments, we will be using the `mercury` data frame, thats lists pressure versus temperature measurements of mercury. It contains two variables: temperature and pressure. Let's start with a simple plot: We can clearly see that the pressure rises dramatically once the temperature exceeds 200 degrees celsius. But this plot is still kind of dull, isn't it? Have a look at this code, that specifies a bunch of arguments inside the `plot()` function. The result looks like this. Can you tell which arguments led to which changes in the plot? `xlab` and `ylab` changed the horizontal and vertical axis labels, respectively, while `main` specified the plot title. If you set the `type` argument to `o`, you will have both points and a line through these points on your plot. If you only want a line, you can use type = "l", which looks like this: Finally, the `col` argument specifies the plot color. Most of the arguments that are used here, such as `xlab`, `ylab`, `main` and `type`, are specified in the documentation of the `plot()` function. However, the `plot()` function also allows you to set a bunch of other graphical parameters. An example of such a graphical parameter is `col`, which specifies the color. But there are many others. You can specify these graphical parameters straight inside the `plot()` function, as you did with `col`. In this case here, the graphical parameter only has an effect on this specific graph. If you now plot the same graph, without the col argument, the green color is not there anymore. You can also inspect and control these same graphical parameters with the `par()` function. Typing question mark par opens up its documentation, with information on all the parameters that you can specify. Simply calling `par()` gives you the actual values of these parameters. You can also use `par()` with arguments to specify session-wide graphical parameters: Suppose you set the color to blue using the `par()` function, ..., and now create a plot. It's blue. If you next create another plot, ..., the plot is still blue! That's because parameters specified with `par()` are maintained for different plotting operations. If you list all graphical parameters again and select the `col` element, ..., you'll see that indeed, it's still set to blue. For the rest of this video, let me focus on some of the most important graphical parameters. I'll do this by adding arguments to the plot function, instead of using the par function for this. You already know the first 5 graphical parameters. Similar to the `col` argument, the `col.main` argument specifies the color of the main title. There are also other col dot arguments to set the color of other elements in your plot. Next, the `cex dot axis` argument specifies with which ratio the original font size of the axis tick marks should be multiplied. With cex dot axis equal 0.6, we have small labels, with cex.axis to 1.5, the labels become huge. Just as the col parameter has col dot variants for other elements in the plot, the cex parameter also has its cex dot variants. The `lty` argument specifies the line type. a line type of 1 is a full line, and the types 2 to 6 are all different types of lines, like you can see here. And last but not least, we have the `pch` argument, which specifies a plot symbol for the points you are plotting. There are more than 35 different symbols for plotting, going from plusses and small octogonals to stars and hashtags. Like I said, all of these arguments are just the tip of the iceberg. One of R's main powers is visualization and this is clear from the numerous ways in which you can make your plots ready for a report. Just make sure you don't overdo things; interpretability should be the main goal at all times!
Views: 39978 DataCamp
Learn the basics of Machine Learning with R. Start our Machine Learning Course for free: https://www.datacamp.com/courses/introduction-to-machine-learning-with-R In the previous video, you learned about three machine learning techniques: Classification, Regression and Clustering. As you might have felt, there are quite some similarities between Classification and Regression. For both, you try to find a function, or a model, which can later be used to predict labels or values for unseen observations. It is important that during the training of the function, labeled observations are available to the algorithm. We call these techniques supervised learning. Labeling can be a tedious work and is often done by puny humans. There are other techniques which don't require labeled observations to learn from data. These techniques are called unsupervised learning. You've already acquainted yourself with one of these techniques in the previous video, namely Clustering. Clustering will find groups of observations that are similar, and thus does not require specific labeled observations. In the next chapter we'll talk about assessing the performance of your trained model. In supervised learning, we can use the real labels of the observations and compare them with the labels we predicted. It's quite straightforward that you want your model's predictions to be as similar as possible to the real labels. With unsupervised learning, however, measuring the performance gets more difficult: we don't have real labels to compare anything to. You'll learn some neat techniques to assess the quality of a clustering in the next chapter. As you get more experienced as a data scientist, you might notice that things aren't always black and white. In machine learning, some techniques overlap between supervised and unsupervised learning. With semi-supervised learning, for example, you can have alot of observations which are not labeled, and a few which are. You can then first perform clustering to group all observations which are similar. Afterwards, you can use information about the clusters and about the few labeled observations to assign a class to unlabeled observations. This will give you more labeled observations to perform supervised learning on. Enough talking, let's do some more exercises!
Views: 32000 DataCamp
Learn more about dplyr: https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial If you've never missed a flight, you're spending too much time at the airport. At least, that's what a Nobel winning economist once said. I'd hate to miss a flight -- but I'd also hate to spend too much time at an airport. The hflights data set can help you avoid both. The data set records every flight that departed from Houston, TX in 2011. And that's no mean feat. In 2011, Houston was a major hub for at least three carriers. Continental Airlines, Express Jet, and Southwest Airlines. Many other airlines flew there too. Hidden in the hflights data set are clues about which airlines are the most reliable. For example, you'll see which had the longest delays and which had the most cancellations. The data is saved as hflights, a data frame that R loaded along with the hflights package. You can look at the data directly by typing hflights at the command line, but that might be a bittersweet experience. R will try to show you the entire data set, which is a quarter million rows long. Eventually R will give up, but not before filling your console -- and its display buffer with data, R's version of a data deluge... dplyr can help you look at the dataset. It provides a new data structure for R, the tbl. A tbl is just a special type of data frame, but R knows how to display it properly. To turn hflights into a tbl, you can run tbl_df on it. Now R only displays the amount of data that will fit in your console window. Not only does R cut out superfluous rows, it cuts out superfluous columns -- so you don't get that disorienting wrap around effect. The tbl display also tells you the dimensions of the full data set, as well as the names and data types of each column that is not shown. The best thing about this display is that it adapts to your window size. If you change the size of your console window and rerun hflights, R will display a different portion of the data -- whichever portion fills your new window size. If your data set is small enough, R will display the whole thing even if it fills slightly more than one window. If you'd like to see a more complete display of the tbl data, you can use the function glimpse. Glimpse shows you the data types and the initial values of each column in the data set. If you don't like the tbl format, you can change your structure back with something like as.data.frame I'll use the tbl format throughout this course, and I suggest that you do too. tbl's do more than make your data easy to look at, they also make it easier to work with.
Views: 4239 DataCamp
Part 1 of our course Merging DataFrames with pandas, taught by Dhavide Aruliah. Interested in learning more? Check our Dhavide's course here: https://www.datacamp.com/courses/merging-dataframes-with-pandas As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions. In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course. Welcome to "Merging DataFrames with Pandas". My name is Dhavide Aruliah. I'm an applied mathematician and data scientist. This course is all about merging and combining DataFrames for your data science needs. Your data rarely exists as DataFrames from the outset: you generally have to deal with text files, spreadsheets, and databases. Let's first check out how to read multiple files into a collection of DataFrames. The primary tool we've used for data import is read_csv(). This function accepts the filepath of a comma-separated values file as input and returns a Pandas DataFrame directly. read_csv() has about fifty optional calling parameters permitting very fine-tuned data import. Pandas has other convenient tools (with similar default calling syntax) that import various data formats like Excel, HTML, or JSON into DataFrames. To read multiple files using Pandas, we generally need separate DataFrames. For example, here we call pd.read_csv() twice to read two CSV files---sales-jan-2015.csv & sales-feb-2015.csv---into two distinct DataFrames. It's generally more efficient to iterate over a collection of file names. With that goal, we can create a list filenames with the two filepaths from before. We then initialize an empty list called dataframes and iterate through the list filenames. Within each iteration, we invoke read_csv() to read a DataFrame from a file and we append the resulting DataFrame to the list dataframes. We can also do the preceding computation with a list comprehension. Comprehensions are a convenient Python construction for exactly this kind of loop where an empty list is appended to within each iteration. You can check out DataCamp's Python programming courses for more details on comprehensions. When many filenames have a similar pattern, the glob module from the Python Standard Library is very useful. Here, we start by importing the function glob() from the built-in glob module. We use the pattern sales asterisk dot csv to match any strings that start with prefix sales and end with the suffix dot csv. The asterisk is a wildcard that matches zero or more standard characters. The function glob() uses the wildcard pattern to create an iterable object filenames containing all matching filenames in the current directory. Finally, the iterable filenames is consumed in a list comprehension that makes a list called dataframes containing the relevant data structures. Now it's your turn to practice reading multiple files into DataFrames.
Views: 9700 DataCamp
This is the 3rd video of chapter 1 of Network Analysis by Eric Ma. Take Eric's course: https://www.datacamp.com/courses/network-analysis-in-python-part-1 From online social networks such as Facebook and Twitter to transportation networks such as bike sharing systems, networks are everywhere, and knowing how to analyze this type of data will open up a new world of possibilities for you as a Data Scientist. This course will equip you with the skills to analyze, visualize, and make sense of networks. You'll apply the concepts you learn to real-world network data using the powerful NetworkX library. With the knowledge gained in this course, you'll develop your network thinking skills and be able to start looking at your data with a fresh perspective! Transcript: You may have seen node-link diagrams involving more than a hundred thousand nodes. They purport to show a visual representation of the network, but in reality just show a hairball. In this section, we are going to look at alternate ways of visualizing network data that are much more rational. I’m going to introduce to you three different types of network visualizations. The first is visualizing a network using a Matrix Plot. The second is what we call an “Arc Plot”, and the third is called “Circos Plot”. Let’s start first with a Matrix Plot. In a Matrix Plot, nodes are the rows and columns of a matrix, and cells are filled in according to whether an edge exists between the pairs of nodes. On these slides, the left matrix is the matrix plot of the graph on the right. In an undirected graph, the matrix is symmetrical around the diagonal, which I’ve highlighted in grey. I’ve also highlighted one edge in the toy graph, edge (A, B), which is equivalent to the edge (B, A). Likewise for edge (A, C), it is equivalent to the edge (C, A), because there’s no directionality associated with it. If the graph were a directed graph, then the matrix representation is not necessarily going to be symmetrical. In this example, we have a bidirectional edge between A and C, but only an edge from A to B and not B to A. Thus, we will have (A, B) filled in, but not (B, A). If the nodes are ordered along the rows and columns such that neighbours are listed close to one another, then a matrix plot can be used to visualize clusters, or communities, of nodes. Let’s now move on to Arc Plots. An Arc Plot is a transformation of the node-link diagram layout, in which nodes are ordered along one axis of the plot, and edges are drawn using circular arcs from one node to another. If the nodes are ordered according to some some sortable rule, e.g. age in a social network of users, or otherwise grouped together, e.g. by geographic location in map for a transportation network, then it will be possible to visualize the relationship between connectivity and the sorted (or grouped) property. Arc Plots are a good starting point for visualizing a network, as it forms the basis of the later plots that we’ll take a look at. Let’s now move on to Circos Plots. A CircosPlot is a transformation of the ArcPlot, such that the two ends of the ArcPlot are joined together into a circle. Circos Plots were originally designed for use in genomics, and you can think of them as an aesthetic and compact alternative to Arc Plots. You will be using a plotting utility that I developed called nxviz. Here’s how to use it. Suppose we had a Graph G in which we added nodes and edges. To visualize it using nxviz, we first need to import nxviz as nv, and import matplotlib to make sure that we can show the plot later. Next, we instantiate a new nv.ArcPlot() object, and pass in a graph G. We can also order nodes by the values keyed on some “key”. Finally, we can call the draw() function, and as always, we also call plt.show(). The code example here shows you how to create an Arc Plot using nxviz, and you’ll get a chance to play around with the other plots in the exercises. Alright! Let’s get hacking! https://www.datacamp.com/courses/network-analysis-in-python-part-1
Views: 7577 DataCamp
Learn about empirical cumulative distribution functions: https://www.datacamp.com/courses/statistical-thinking-in-python-part-1 We saw in the last video the clarity of bee swarm plots. However, there is a limit to their efficacy. For example, imagine we wanted to plot the county-level voting data for all states east of the Mississippi River and all states west. We make the swarm plot as before, but using a DataFrame that contains all states, with each classified as being east or west of the Mississippi. The bee swarm plot has a real problem. The edges have overlapping data points, which was necessary in order to fit all points onto the plot. We are now obfuscating data. So, using a bee swarm plot here is not the best option. As an alternative, we can compute an empirical cumulative distribution function, or ECDF. Again, this is best explained by example. Here is a picture of an ECDF of the percentage of swing state votes that went to Obama. A x-value of an ECDF is the quantity you are measuring, in this case the percent of vote that sent to Obama. The y-value is the fraction of data points that have a value smaller than the corresponding x-value. For example, 20% of counties in swing states had 36% or less of its people vote for Obama. Similarly, 75% of counties in swing states had 50% or less of its people vote for Obama. Let's look at how to make one of these from our data. The x-axis is the sorted data. We need to generate it using the NumPy function sort, so we need to import Numpy, which we do using the alias np as is commonly done. The we can use np.sort() to generate our x-data. The y-axis is evenly spaced data points with a maximum of one, which we can generate using the np.arange() function and then dividing by the total number of data points. Once we specify the x and y values, we plot the points. By default, plt.plot() plots lines connecting the data points. To plot our ECDF, we just want points. To achieve this we pass the string '.' and the string 'none' to the keywords arguments marker and linestyle, respectively. As you remember from my forceful reminder in an earlier video, we label the axes. Finally, we use the plt.margins() function to make sure none of the data points run over the side of the plot area. Choosing a value of 0.02 gives a 2% buffer all around the plot. The result is the beautiful ECDF I just showed you. We can also easily plot multiple ECDFs on the same plot. For example, here are the ECDFs for the three swing states. We see that Ohio and Pennsylvania were similar, with Pennsylvania having slightly more Democratic counties. Florida, on the other hand, had a greater fraction of heavily Republican counties. In my workflow, I almost always plot the ECDF first. It shows all the data and gives a complete picture of how the data are distributed. But don't take my word for how great ECDFs are. You can see for yourself in the exercises!
Views: 34826 DataCamp